Predictive analytics: Obliterated

Thoughts

Questions

  • What deficit are we talking about? Is it covered by the mRS?
  • Should the pre-treatment mRS be in the model? Already covered by the others?

Backlog

  • Honest estimation
  • G estimation (using only significant ones to get accurate ORs)
  • Make sure that the number of outcomes (26) is valid.
  • Bootstrap the whole process - how many models include each of the features?
  • The SuSiE method: https://www.biorxiv.org/content/10.1101/501114v1
  • Choose best model with forward/backward exclusion
  • Feature selection with honest estimation + p-value-based + bootstrap
  • Feature selection with honest estimation + LASSO-based + bootstrap
  • Implement PCA to create independent variables and not drop variables
  • Bayesian fitting
  • Change to use tidybayes and tidymodels
  • Use neural networks to fit
  • Consider a time-to-event analysis

Setup

Imports

# Load packages
library(glmnet)
library(magrittr)
library(selectiveInference)
library(tidyverse)

# Data
# https://office365stanford-my.sharepoint.com/:x:/r/personal/simonlev_stanford_edu/_layouts/15/Doc.aspx?sourcedoc=%7BCE7DAF19-8F1A-43B0-866D-3D0F36336EDC%7D&file=Peds%20AVM%20Database%20July%2010%202024%20Onwards.xlsx&fromShare=true&action=default&mobileredirect=true

# Source code

Configurations

# Source outcome-specific configs
source(params$CONFIG)

# Destination locations
DST_DIRNAME <- paste0("../../outputs/")

# Source locations
SRC_DIRNAME <- "../../data/3_tidy/patients"
SRC_BASENAME <- "patients-daily_2024-07-21.csv"

# Should the Dockerfile be automatically updated?
UPDATE_DOCKERFILE <- FALSE

# Set the seed
set.seed(1891)

Parameters

EXPOSURES_CONTINUOUS <- c(
  "Age at 1st treatment (years)" = "age_at_first_treatment_yrs",
  "mRS (presentation)" = "modified_rankin_score_presentation",
  "mRS (pre-treatment)" = "modified_rankin_score_pretreatment",
  "mRS (1-week post-op)" = "modified_rankin_score_postop_within_1_week",
  "mRS (final)" = "modified_rankin_score_final",
  "Nidus size (cm)" = "max_size_cm",
  "Spetzler-Martin grade" = "spetzler_martin_grade",
  "Lawton-Young grade" = "lawton_young_grade",
  "Size score" = "size_score"
)

EXPOSURES_BINARY <- c(
  "Sex (Male)" = "is_male",
  "Involves eloquent location" = "is_eloquent_location",
  "Has deep venous drainage" = "has_deep_venous_drainage",
  "Diffuse nidus" = "is_diffuse_nidus",
  "Hemorrhage at presentation" = "has_hemorrhage",
  "Seizures at presentation" = "has_seizures",
  "Deficit at presentation" = "has_deficit",
  "Paresis at presentation" = "has_paresis",
  "Presence of aneurysms" = "has_associated_aneurysm",
  "Spetzler-Martin grade < 4" = "is_spetzler_martin_grade_less_than_4",
  "Had surgery" = "is_surgery"
)

EXPOSURES_CATEGORICAL <- c(
  "Location" = "location",
  "Venous drainage" = "venous_drainage",
  "Modality of treatment" = "procedure_combinations"
)

OUTCOMES <- c(
  "Poor outcome (mRS >= 3)" = "is_poor_outcome",
  "Obliteration" = "is_obliterated",
  "Complications - minor" = "has_complication_minor",
  "Complications - major" = "has_complication_major",
  "mRS change (final - pre-treatment)" =
    "modified_rankin_score_final_minus_pretx",
  "mRS change (final - presentation)" =
    "modified_rankin_score_final_minus_presentation",
  "mRS change direction (Final - Pre-tx)" = "change_in_mrs_tx_vs_final"
)

Outputs

Create lists.

plots <- list()
tables <- list()
models <- list()

Functions

R utils.

Data analysis utils.


Read

# File path
filepath <- file.path(SRC_DIRNAME, SRC_BASENAME)

# Read
patients_ <-
  read_csv(filepath) %>%
  dplyr::sample_frac(params$PROPORTION_OF_DATA)

# If analyzing a subset of the data, restrict the dataset
if (params$TITLE == "Poor outcome - Hemorrhage") {
  patients_ <- patients_ %>% filter(has_hemorrhage)
}
if (params$TITLE == "Poor outcome - No Hemorrhage") {
  patients_ <- patients_ %>% filter(!has_hemorrhage)
}

Conform

Setup

Remove all empty rows.

patients_ <-
  patients_ %>%
  filter(!is.na(patient_id)) %>%
  arrange(patient_id)

Create two dataframes - one for univariable and one for multivariable analysis. This is because variables for multivariable analysis will be recoded to reduce the number of levels to avoid overfitting the model.

df_uni <-
  patients_ %>%
  filter(is_eligible) %>% 
  filter(!is.na(spetzler_martin_grade))

df_multi <-
  patients_ %>%
  filter(is_eligible) %>% 
  filter(!is.na(spetzler_martin_grade))

Recode columns

For venous_drainage if “Both”, mark as “Deep.”

df_multi <-
  df_multi %>%
  mutate(
    venous_drainage = ifelse(venous_drainage == "Both", "Deep", venous_drainage)
  )

For procedure_combinations, change > 1 to multimodal to reduce levels.

df_multi <-
  df_multi %>%
  mutate(procedure_combinations = case_when(
    nchar(procedure_combinations) > 1 ~ "Multimodal",
    .default = procedure_combinations
  ))

For procedure_combinations, indicate if surgery-based or not.

df_multi <-
  df_multi %>%
  mutate(
    is_surgery = str_detect(procedure_combinations, "S")
  ) %>%
  relocate(is_surgery, .after = procedure_combinations)

df_uni <-
  df_uni %>%
  mutate(
    is_surgery = str_detect(procedure_combinations, "S")
  ) %>%
  relocate(is_surgery, .after = procedure_combinations)

For location, reduce choices (use “Other”).

df_multi <-
  df_multi %>%
  mutate(
    location = ifelse(location == "Corpus Callosum", "Other", location),
    location = ifelse(location == "Cerebellum", "Other", location),
    # location = ifelse(location == "Deep", "Other", location),
    location = factor(location),
    location = relevel(location, ref = "Other")
  )

For spetzler_martin_grade, create a binary variant of 1-3 vs. 4-5.

# For multivariable analysis
df_multi <-
  df_multi %>%
  mutate(
    # Divides into low grade vs. high grade
    is_spetzler_martin_grade_less_than_4 = spetzler_martin_grade < 4
  ) %>%
  relocate(is_spetzler_martin_grade_less_than_4, .after = spetzler_martin_grade)

# For univariable analysis
df_uni <-
  df_uni %>%
  mutate(
    is_spetzler_martin_grade_less_than_4 = spetzler_martin_grade < 4
  ) %>%
  relocate(is_spetzler_martin_grade_less_than_4, .after = spetzler_martin_grade)

Missingness

# Get cols
cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL,
  OUTCOMES
))

# Count missingness
df_multi %>%
  select(cols) %>%
  summarise_all(~ sum(is.na(.))) %>%
  pivot_longer(everything(), values_to = "num_missing") %>%
  arrange(desc(num_missing)) %>%
  filter(num_missing > 0) %>%
  sable()
name num_missing
modified_rankin_score_pretreatment 2
modified_rankin_score_postop_within_1_week 2
lawton_young_grade 2
has_associated_aneurysm 2
is_surgery 2
procedure_combinations 2
modified_rankin_score_final_minus_pretx 2
is_male 1
is_diffuse_nidus 1
location 1
has_complication_minor 1
has_complication_major 1

Which eligible patients are missing each variable?

df_multi %>%
  filter(is_eligible) %>%
  select(mrn, cols) %>%
  mutate(across(-mrn, is.na)) %>%
  pivot_longer(
    cols = -mrn,
    names_to = "name",
    values_to = "is_missing"
  ) %>%
  filter(is_missing) %>%
  group_by(name) %>%
  summarize(mrn = str_c(mrn, collapse = ", ")) %>%
  sable()
name mrn
has_associated_aneurysm 60310463, 61011961
has_complication_major 62023254
has_complication_minor 62023254
is_diffuse_nidus 15871379
is_male 19399469
is_surgery 49659493, 46166047
lawton_young_grade 60310463, 61011961
location 60310463
modified_rankin_score_final_minus_pretx 49659493, 46166047
modified_rankin_score_postop_within_1_week 49659493, 46166047
modified_rankin_score_pretreatment 49659493, 46166047
procedure_combinations 49659493, 46166047

Visualizations

Overall

Distribution of values of the exposures across levels of the outcome.

# Define the predictors of interest
cols <- c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  CHOSEN_OUTCOME
)

# Reshape the data
df_long <-
  df_uni %>%
  select(all_of(cols)) %>%
  pivot_longer(-CHOSEN_OUTCOME, names_to = "predictor", values_to = "value")

# Plot the box plots
df_long %>%
  ggplot(aes_string(
    x = CHOSEN_OUTCOME,
    y = "value",
    fill = CHOSEN_OUTCOME,
    color = CHOSEN_OUTCOME
  )) +
  geom_boxplot(alpha = 0.5) +
  facet_wrap(~predictor, scales = "free") +
  labs(x = "Outcome", y = "Value", fill = "Outcome") +
  theme_minimal() +
  theme(
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 11),
    strip.text = element_text(size = 11),
    legend.position = "none"
  )

By hemorrhage

# Define the predictors of interest
cols <- c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  CHOSEN_OUTCOME
)

# Reshape the data
df_long <-
  df_uni %>%
  select(all_of(cols)) %>%
  pivot_longer(
    cols = -c(CHOSEN_OUTCOME, `Hemorrhage at presentation`),
    names_to = "predictor",
    values_to = "value"
  )

# Plot the box plots
df_long %>%
  ggplot(aes_string(
    x = CHOSEN_OUTCOME,
    y = "value",
    fill = "`Hemorrhage at presentation`",
    color = "`Hemorrhage at presentation`"
  )) +
  geom_boxplot(alpha = 0.5) +
  facet_wrap(~predictor, scales = "free") +
  labs(x = "Outcome", y = "Value", fill = "Outcome") +
  theme_minimal() +
  guides(fill = "none") +
  theme(
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 11),
    strip.text = element_text(size = 11)
  )

By SM grade high vs. low

# Define the predictors of interest
cols <- c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  CHOSEN_OUTCOME
)

# Reshape the data
df_long <-
  df_uni %>%
  select(all_of(cols)) %>%
  pivot_longer(
    cols = -c(CHOSEN_OUTCOME, `Spetzler-Martin grade < 4`),
    names_to = "predictor",
    values_to = "value"
  )

# Plot the box plots
df_long %>%
  ggplot(aes_string(
    x = CHOSEN_OUTCOME,
    y = "value",
    fill = "`Spetzler-Martin grade < 4`",
    color = "`Spetzler-Martin grade < 4`"
  )) +
  geom_boxplot(alpha = 0.5) +
  facet_wrap(~predictor, scales = "free") +
  labs(x = "Outcome", y = "Value", fill = "Outcome") +
  theme_minimal() +
  guides(fill = "none") +
  theme(
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 11),
    strip.text = element_text(size = 11)
  )


Descriptive statistics

Setup

Descriptive statistics - continuous variables.

compute_descriptive_continuous <- function(
    df = df_uni,
    cols = c(EXPOSURES_CONTINUOUS),
    use_interaction = TRUE,
    interaction_col = "has_hemorrhage",
    interaction_col_name = "Hemorrhage at presentation",
    interaction_col_value = TRUE,
    prefix = "Haem") {
  # Apply desired formatting to numbers
  format_numbers <- function(x) {
    sprintf("%.1f", x)
  }

  # Compute new dataset
  df <-
    df %>%
    select(cols, CHOSEN_OUTCOME, interaction_col) %>%
    drop_na(CHOSEN_OUTCOME, interaction_col)

  # Apply interaction
  if (use_interaction) {
    df <-
      df %>%
      filter(!!sym(interaction_col) == interaction_col_value)
  }

  # Compute descriptive statistics
  desc_stats <-
    df %>%
    pivot_longer(where(is.numeric), names_to = "keys", values_to = "values") %>%
    group_by(!!sym(CHOSEN_OUTCOME), keys) %>%
    summarize(
      num_total = n(),
      num_missing = sum(is.na(values)),
      mean = mean(values, na.rm = TRUE),
      sd = sd(values, na.rm = TRUE),
      min = quantile(values, 0, na.rm = TRUE),
      max = quantile(values, 1, na.rm = TRUE),
      q_50 = quantile(values, 0.50, na.rm = TRUE),
      q_25 = quantile(values, 0.25, na.rm = TRUE),
      q_75 = quantile(values, 0.75, na.rm = TRUE)
    ) %>%
    ungroup() %>%
    mutate(
      stats = glue::glue("{format_numbers(mean)} ({format_numbers(sd)})"),
      !!sym(CHOSEN_OUTCOME) := factor(
        !!sym(CHOSEN_OUTCOME),
        levels = CHOSEN_OUTCOME_LEVELS,
        labels = CHOSEN_OUTCOME_LABELS
      )
    ) %>%
    pivot_wider(
      id_cols = keys,
      names_from = !!sym(CHOSEN_OUTCOME),
      values_from = stats
    ) %>%
    relocate(CHOSEN_OUTCOME_LABELS[2], .after = CHOSEN_OUTCOME_LABELS[1])

  # Compute p-values
  pvals <-
    df %>%
    pivot_longer(where(is.numeric), names_to = "keys", values_to = "values") %>%
    group_by(keys) %>%
    summarize(
      pvalue = wilcox.test(values ~ !!sym(CHOSEN_OUTCOME))$p.value
    )

  # Synthesize
  out <-
    desc_stats %>%
    left_join(pvals)

  # Add sample size to column names
  num_with_outcome <- sum(df %>% pull(CHOSEN_OUTCOME))
  num_no_outcome <- sum(!df %>% pull(CHOSEN_OUTCOME))

  new_col_names <- c(
    glue::glue("{prefix} - {CHOSEN_OUTCOME_LABELS[2]} (n={num_with_outcome})"),
    glue::glue("{prefix} - {CHOSEN_OUTCOME_LABELS[1]} (n={num_no_outcome})"),
    glue::glue("{prefix} - P-value")
  )

  out <-
    out %>%
    rename_with(
      .cols = !!sym(CHOSEN_OUTCOME_LABELS[1]):pvalue, ~new_col_names
    ) %>%
    mutate(
      keys = ifelse(keys == interaction_col, interaction_col_name, keys)
    )

  return(out)
}

Descriptive statistics - binary variables.

compute_descriptive_binary <- function(
    df = df_uni,
    cols = c(EXPOSURES_BINARY),
    use_interaction = TRUE,
    interaction_col = "has_hemorrhage",
    interaction_col_name = "Hemorrhage at presentation",
    interaction_col_value = TRUE,
    prefix = "Haem") {
  # Define arguments
  cols <- c(cols[!cols == interaction_col])

  # Apply desired formatting to numbers
  format_numbers <- function(x, decimals = 0) {
    decimals <- paste0("%.", decimals, "f")
    sprintf(decimals, x)
  }

  # Compute new dataset
  df <-
    df %>%
    select(all_of(cols), CHOSEN_OUTCOME, interaction_col) %>%
    drop_na(CHOSEN_OUTCOME, interaction_col)

  # Apply interaction
  if (use_interaction) {
    df <-
      df %>%
      filter(!!sym(interaction_col) == interaction_col_value)
  }

  # Compute descriptive statistics
  desc_stats <-
    df %>%
    pivot_longer(
      cols = -c(CHOSEN_OUTCOME),
      names_to = "keys",
      values_to = "values"
    ) %>%
    group_by(!!sym(CHOSEN_OUTCOME), keys) %>%
    summarize(
      num_total = n(),
      num_missing = sum(is.na(values)),
      num_with = sum(values, na.rm = TRUE),
      num_without = sum(!values, na.rm = TRUE),
      pct_with = mean(values, na.rm = TRUE)
    ) %>%
    ungroup() %>%
    mutate(
      stats = glue::glue("{num_with} ({format_numbers(pct_with*100)}%)"),
      !!sym(CHOSEN_OUTCOME) := factor(
        !!sym(CHOSEN_OUTCOME),
        levels = CHOSEN_OUTCOME_LEVELS,
        labels = CHOSEN_OUTCOME_LABELS
      )
    ) %>%
    pivot_wider(
      id_cols = keys,
      names_from = !!sym(CHOSEN_OUTCOME),
      values_from = stats
    ) %>%
    relocate(CHOSEN_OUTCOME_LABELS[1], .after = CHOSEN_OUTCOME_LABELS[2])

  # Compute p-values
  pvals <-
    df %>%
    pivot_longer(
      cols = -c(CHOSEN_OUTCOME),
      names_to = "keys",
      values_to = "values"
    ) %>%
    group_by(keys) %>%
    summarize(
      pvalue = kruskal.test(values ~ !!sym(CHOSEN_OUTCOME))$p.value
    )

  # Synthesize
  out <-
    desc_stats %>%
    left_join(pvals)

  # Prettify
  num_with_outcome <- sum(df %>% pull(CHOSEN_OUTCOME))
  num_no_outcome <- sum(!df %>% pull(CHOSEN_OUTCOME))

  new_col_names <- c(
    glue::glue("{prefix} - {CHOSEN_OUTCOME_LABELS[2]} (n={num_with_outcome})"),
    glue::glue("{prefix} - {CHOSEN_OUTCOME_LABELS[1]} (n={num_no_outcome})"),
    glue::glue("{prefix} - P-value")
  )

  out <-
    out %>%
    rename_with(
      .cols = !!sym(CHOSEN_OUTCOME_LABELS[2]):pvalue, ~new_col_names
    ) %>%
    mutate(
      keys = ifelse(keys == interaction_col, interaction_col_name, keys)
    )

  return(out)
}

Non-specific

# Rename dataframe
`Pediatric AVMs` <-
  df_uni %>%
  filter(is_eligible) %>%
  select(-is_eligible, -comments)

# Create summary
`Pediatric AVMs` %>%
  summarytools::dfSummary(display.labels = FALSE) %>%
  print(
    file =
      "../../outputs/descriptive-statistics/descriptive_statistics_eligible.html",
    footnote = NA
  )

# Remove unwanted dataframe
remove(`Pediatric AVMs`)

By hemorrhage - table1 (WIP)

https://cran.r-project.org/web/packages/table1/vignettes/table1-examples.html

library(table1)

table1::table1(
  ~ factor(is_male) + age_at_first_treatment_yrs + factor(is_eloquent_location) + max_size_cm | has_hemorrhage,
  data = df_uni
)
# Initialize values
df <- df_uni

# Remove missing values in stratification variables
df <-
  df %>%
  drop_na(has_hemorrhage, CHOSEN_OUTCOME)

# Define columns
cols <- c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  OUTCOMES[OUTCOMES == CHOSEN_OUTCOME]
)

# Assign labels to the variables in the dataframe
for (var in cols) {
  label(df[[var]]) <- names(cols[cols == var])
}

# Assign meaningful labels to the has_hemorrhage variable
df$has_hemorrhage <- factor(
  df$has_hemorrhage,
  levels = c(FALSE, TRUE),
  labels = c("No Hemorrhage", "Has Hemorrhage")
)

df[, CHOSEN_OUTCOME] <- factor(
  df %>% pull(CHOSEN_OUTCOME),
  levels = CHOSEN_OUTCOME_LEVELS,
  labels = CHOSEN_OUTCOME_LABELS
)

# Define custom rendering functions if needed
render_continuous <- function(x) {
  with(stats.apply.rounding(stats.default(x), digits = 2), c(
    "Mean (SD)" = sprintf("%s (&plusmn; %s)", MEAN, SD),
    "Median (IQR)" = sprintf("%s (%s - %s)", MEDIAN, Q1, Q3)
  ))
}

compute_pvalues <- function(x, ...) {
  # Construct vectors of data y, and groups (strata) g
  y <- unlist(x)
  g <- factor(rep(1:length(x), times = sapply(x, length)))
  if (is.numeric(y)) {
    # For numeric variables, perform a standard 2-sample t-test
    p <- t.test(y ~ g)$p.value
  } else {
    # For categorical variables, perform a chi-squared test of independence
    p <- chisq.test(table(y, g))$p.value
  }
  # Format the p-value, using an HTML entity for the less-than sign.
  # The initial empty string places the output on the line below the variable label.
  c("", sub("<", "&lt;", format.pval(p, digits = 3, eps = 0.001)))
}

# Create the descriptive table
frla <- as.formula(paste("~ . | has_hemorrhage *", CHOSEN_OUTCOME))

# Create table
table1_descriptive_statistics <-
  table1(frla,
    data = df[, unname(cols)],
    render.continuous = render_continuous,
    overall = c(left = "Overall"),
    topclass = "Rtable1-zebra"
  )

# Print
table1_descriptive_statistics

By hemorrhage

if (!str_detect(params$TITLE, "Hemorrhage")) {
  
  # Define arguments
  interaction_col <- "has_hemorrhage"
  interaction_col_name <- "Hemorrhage at presentation"
  prefix_true <- "Haem"
  prefix_false <- "No Haem"
  
  # Get all
  out <- list()
  out[[1]] <- compute_descriptive_continuous(
    use_interaction = F,
    interaction_col = interaction_col,
    interaction_col_name = interaction_col_name,
    interaction_col_value = T,
    prefix = "All"
  )
  out[[2]] <- compute_descriptive_binary(
    use_interaction = F,
    interaction_col = interaction_col,
    interaction_col_name = interaction_col_name,
    interaction_col_value = T,
    prefix = "All"
  )
  out_all <- bind_rows(out)
  
  # Get with hemorrhage
  out <- list()
  out[[1]] <- compute_descriptive_continuous(
    use_interaction = T,
    interaction_col = interaction_col,
    interaction_col_name = interaction_col_name,
    interaction_col_value = T,
    prefix = prefix_true
  )
  out[[2]] <- compute_descriptive_binary(
    use_interaction = T,
    interaction_col = interaction_col,
    interaction_col_name = interaction_col_name,
    interaction_col_value = T,
    prefix = prefix_true
  )
  out_haem <- bind_rows(out)
  
  # Get without hemorrhage
  out <- list()
  out[[1]] <- compute_descriptive_continuous(
    use_interaction = T,
    interaction_col = interaction_col,
    interaction_col_name = interaction_col_name,
    interaction_col_value = F,
    prefix = prefix_false
  )
  out[[2]] <- compute_descriptive_binary(
    use_interaction = T,
    interaction_col = interaction_col,
    interaction_col_name = interaction_col_name,
    interaction_col_value = F,
    prefix = prefix_false
  )
  out_no_haem <- bind_rows(out)
  
  # Synthesize
  out_complete <-
    out_all %>%
    left_join(out_haem, by = "keys") %>%
    left_join(out_no_haem, by = "keys") %>%
    arrange(`All - P-value`)
  
  # Print
  out_complete %>%
    sable()
  
}
keys All - Obliterated (n=178) All - Not obliterated (n=50) All - P-value Haem - Obliterated (n=117) Haem - Not obliterated (n=20) Haem - P-value No Haem - Obliterated (n=61) No Haem - Not obliterated (n=30) No Haem - P-value
Had surgery 123 (70%) 10 (20%) 0.00000 87 (76%) 3 (15%) 0.00000 36 (59%) 7 (23%) 0.00143
Lawton-Young grade 5.7 (1.5) 4.0 (1.4) 0.00000 5.2 (1.4) 3.7 (1.3) 0.00009 6.1 (1.4) 4.7 (1.4) 0.00004
Spetzler-Martin grade 3.8 (1.1) 2.6 (1.2) 0.00000 3.6 (1.0) 2.5 (1.1) 0.00020 3.9 (1.1) 2.8 (1.2) 0.00006
Nidus size (cm) 5.1 (3.0) 3.0 (1.8) 0.00000 4.3 (2.9) 2.8 (1.8) 0.02157 5.6 (3.0) 3.4 (1.8) 0.00006
Spetzler-Martin grade < 4 138 (78%) 21 (42%) 0.00000 95 (81%) 11 (55%) 0.00994 43 (70%) 10 (33%) 0.00078
Size score 2.1 (0.8) 1.6 (0.7) 0.00000 1.8 (0.8) 1.5 (0.6) 0.06775 2.4 (0.7) 1.7 (0.7) 0.00018
Involves eloquent location 95 (53%) 44 (88%) 0.00001 56 (48%) 20 (100%) 0.00002 39 (64%) 24 (80%) 0.12057
Diffuse nidus 15 (8%) 16 (32%) 0.00002 10 (9%) 6 (30%) 0.00595 5 (8%) 10 (33%) 0.00285
mRS (final) 1.8 (1.8) 0.8 (0.9) 0.00012 2.4 (1.9) 0.8 (0.9) 0.00019 1.5 (1.7) 0.7 (0.9) 0.02199
Hemorrhage at presentation 117 (66%) 20 (40%) 0.00105 117 (100%) 20 (100%) 0 (0%) 0 (0%)
Has deep venous drainage 90 (51%) 35 (70%) 0.01489 67 (57%) 15 (75%) 0.13628 23 (38%) 20 (67%) 0.00968
Deficit at presentation 59 (33%) 26 (52%) 0.01507 40 (34%) 11 (55%) 0.07626 19 (31%) 15 (50%) 0.08222
mRS (presentation) 1.8 (1.6) 2.4 (1.8) 0.01602 2.9 (1.8) 3.2 (1.7) 0.47279 1.1 (1.0) 1.1 (1.0) 0.95248
Presence of aneurysms 33 (19%) 13 (27%) 0.19296 27 (23%) 7 (35%) 0.25571 6 (10%) 6 (21%) 0.13927
Paresis at presentation 43 (24%) 15 (30%) 0.40296 31 (26%) 9 (45%) 0.09377 12 (20%) 6 (20%) 0.97072
mRS (pre-treatment) 1.5 (1.4) 1.6 (1.5) 0.49277 2.2 (1.6) 2.0 (1.6) 0.59646 1.0 (0.9) 1.0 (0.9) 0.84416
Sex (Male) 92 (52%) 28 (57%) 0.49892 63 (54%) 10 (50%) 0.75091 29 (48%) 18 (62%) 0.19975
Seizures at presentation 42 (24%) 14 (28%) 0.52355 22 (19%) 7 (35%) 0.10255 20 (33%) 7 (23%) 0.35604
Age at 1st treatment (years) 11.2 (3.7) 11.3 (3.9) 0.80199 11.3 (3.1) 10.9 (3.6) 0.65741 11.1 (4.1) 12.1 (4.3) 0.21395
mRS (1-week post-op) 1.8 (1.5) 1.6 (1.3) 0.82396 2.4 (1.6) 1.9 (1.3) 0.21666 1.4 (1.3) 1.2 (1.1) 0.75637

By diffuse nidus

# Define arguments
interaction_col <- "is_diffuse_nidus"
interaction_col_name <- "Is diffuse nidus"
prefix_true <- "Diffuse"
prefix_false <- "Not Diffuse"

# Get all
out <- list()
out[[1]] <- compute_descriptive_continuous(
  use_interaction = F,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = "All"
)
out[[2]] <- compute_descriptive_binary(
  use_interaction = F,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = "All"
)
out_all <- bind_rows(out)

# Get with hemorrhage
out <- list()
out[[1]] <- compute_descriptive_continuous(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = prefix_true
)
out[[2]] <- compute_descriptive_binary(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = prefix_true
)
out_haem <- bind_rows(out)

# Get without hemorrhage
out <- list()
out[[1]] <- compute_descriptive_continuous(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = F,
  prefix = prefix_false
)
out[[2]] <- compute_descriptive_binary(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = F,
  prefix = prefix_false
)
out_no_haem <- bind_rows(out)

# Synthesize
out_complete <-
  out_all %>%
  left_join(out_haem, by = "keys") %>%
  left_join(out_no_haem, by = "keys") %>%
  arrange(`All - P-value`)

# Print
out_complete %>%
  sable()
keys All - Obliterated (n=177) All - Not obliterated (n=50) All - P-value Diffuse - Obliterated (n=15) Diffuse - Not obliterated (n=16) Diffuse - P-value Not Diffuse - Obliterated (n=162) Not Diffuse - Not obliterated (n=34) Not Diffuse - P-value
Had surgery 123 (70%) 10 (20%) 0.00000 11 (73%) 4 (25%) 0.00811 112 (70%) 6 (18%) 0.00000
Lawton-Young grade 5.7 (1.5) 4.0 (1.4) 0.00000 7.1 (1.0) 6.3 (1.2) 0.09077 5.1 (1.2) 3.8 (1.2) 0.00000
Spetzler-Martin grade 3.8 (1.1) 2.6 (1.2) 0.00000 4.6 (0.6) 4.1 (1.0) 0.13386 3.4 (1.0) 2.5 (1.1) 0.00002
Nidus size (cm) 5.1 (3.0) 3.0 (1.8) 0.00000 7.4 (3.4) 5.3 (2.2) 0.09128 4.0 (2.0) 2.8 (1.7) 0.00059
Spetzler-Martin grade < 4 137 (77%) 21 (42%) 0.00000 4 (27%) 1 (6%) 0.12866 133 (82%) 20 (59%) 0.00294
Size score 2.1 (0.8) 1.6 (0.7) 0.00000 2.6 (0.6) 2.3 (0.7) 0.22405 1.9 (0.8) 1.5 (0.6) 0.00127
Involves eloquent location 94 (53%) 44 (88%) 0.00001 13 (87%) 16 (100%) 0.13739 81 (50%) 28 (82%) 0.00057
Is diffuse nidus 15 (8%) 16 (32%) 0.00002 15 (100%) 16 (100%) 0 (0%) 0 (0%)
mRS (final) 1.8 (1.8) 0.8 (0.9) 0.00014 2.6 (1.8) 1.0 (0.9) 0.00762 1.5 (1.7) 0.8 (0.9) 0.05093
Hemorrhage at presentation 117 (66%) 20 (40%) 0.00089 10 (67%) 6 (38%) 0.11015 107 (66%) 14 (41%) 0.00681
Deficit at presentation 58 (33%) 26 (52%) 0.01308 7 (47%) 11 (69%) 0.22059 51 (31%) 15 (44%) 0.15742
Has deep venous drainage 90 (51%) 35 (70%) 0.01645 13 (87%) 14 (88%) 0.94575 77 (48%) 21 (62%) 0.13226
mRS (presentation) 1.8 (1.6) 2.5 (1.8) 0.01646 2.0 (1.6) 2.0 (1.6) 0.98320 1.7 (1.7) 2.5 (1.8) 0.01689
Presence of aneurysms 32 (18%) 13 (27%) 0.16753 4 (27%) 6 (38%) 0.52586 28 (17%) 7 (22%) 0.53817
Paresis at presentation 42 (24%) 15 (30%) 0.36760 5 (33%) 7 (44%) 0.55830 37 (23%) 8 (24%) 0.93088
mRS (pre-treatment) 1.5 (1.4) 1.6 (1.5) 0.51007 1.9 (1.6) 1.7 (1.3) 0.96677 1.3 (1.3) 1.6 (1.5) 0.27834
Sex (Male) 92 (52%) 28 (57%) 0.52231 6 (40%) 10 (62%) 0.21781 86 (53%) 18 (55%) 0.87861
Seizures at presentation 42 (24%) 14 (28%) 0.53705 6 (40%) 5 (31%) 0.61668 36 (22%) 9 (26%) 0.59326
Age at 1st treatment (years) 11.2 (3.7) 11.3 (3.9) 0.76249 10.2 (3.6) 10.9 (3.9) 0.63239 11.7 (3.7) 11.4 (3.9) 0.64510
mRS (1-week post-op) 1.8 (1.5) 1.6 (1.3) 0.80914 2.2 (1.6) 1.6 (1.1) 0.31152 1.5 (1.4) 1.6 (1.3) 0.52785

By SM grade high vs. low

# Define arguments
interaction_col <- "is_spetzler_martin_grade_less_than_4"
interaction_col_name <- "Spetzler-Martin grade < 4"
prefix_true <- "Low grade"
prefix_false <- "High grade"

# Get all
out <- list()
out[[1]] <- compute_descriptive_continuous(
  use_interaction = F,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = "All"
)
out[[2]] <- compute_descriptive_binary(
  use_interaction = F,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = "All"
)
out_all <- bind_rows(out)

# Get with hemorrhage
out <- list()
out[[1]] <- compute_descriptive_continuous(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = prefix_true
)
out[[2]] <- compute_descriptive_binary(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = T,
  prefix = prefix_true
)
out_haem <- bind_rows(out)

# Get without hemorrhage
out <- list()
out[[1]] <- compute_descriptive_continuous(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = F,
  prefix = prefix_false
)
out[[2]] <- compute_descriptive_binary(
  use_interaction = T,
  interaction_col = interaction_col,
  interaction_col_name = interaction_col_name,
  interaction_col_value = F,
  prefix = prefix_false
)
out_no_haem <- bind_rows(out)

# Synthesize
out_complete <-
  out_all %>%
  left_join(out_haem, by = "keys") %>%
  left_join(out_no_haem, by = "keys") %>%
  arrange(`All - P-value`)

# Print
out_complete %>%
  sable()
keys All - Obliterated (n=178) All - Not obliterated (n=50) All - P-value Low grade - Obliterated (n=138) Low grade - Not obliterated (n=21) Low grade - P-value High grade - Obliterated (n=40) High grade - Not obliterated (n=29) High grade - P-value
Had surgery 123 (70%) 10 (20%) 0.00000 99 (73%) 4 (19%) 0.00000 24 (60%) 6 (21%) 0.00125
Lawton-Young grade 5.7 (1.5) 4.0 (1.4) 0.00000 4.3 (0.7) 3.5 (1.0) 0.00130 6.7 (0.9) 5.9 (0.9) 0.00190
Spetzler-Martin grade 3.8 (1.1) 2.6 (1.2) 0.00000 2.7 (0.6) 2.1 (0.8) 0.00148 4.6 (0.5) 4.2 (0.4) 0.00579
Nidus size (cm) 5.1 (3.0) 3.0 (1.8) 0.00000 3.1 (1.8) 2.4 (1.1) 0.04734 6.5 (2.9) 5.2 (2.1) 0.02477
Spetzler-Martin grade < 4 138 (78%) 21 (42%) 0.00000 138 (100%) 21 (100%) 0 (0%) 0 (0%)
Size score 2.1 (0.8) 1.6 (0.7) 0.00000 1.5 (0.6) 1.3 (0.5) 0.18362 2.6 (0.5) 2.4 (0.5) 0.11269
Involves eloquent location 95 (53%) 44 (88%) 0.00001 58 (42%) 15 (71%) 0.01205 37 (92%) 29 (100%) 0.13440
Diffuse nidus 15 (8%) 16 (32%) 0.00002 4 (3%) 1 (5%) 0.65442 11 (28%) 15 (52%) 0.04188
mRS (final) 1.8 (1.8) 0.8 (0.9) 0.00012 1.5 (1.7) 0.7 (0.8) 0.05258 2.1 (1.9) 1.1 (1.1) 0.02674
Hemorrhage at presentation 117 (66%) 20 (40%) 0.00105 95 (69%) 11 (52%) 0.13729 22 (55%) 9 (31%) 0.04987
Has deep venous drainage 90 (51%) 35 (70%) 0.01489 58 (42%) 10 (48%) 0.63062 32 (80%) 25 (86%) 0.50506
Deficit at presentation 59 (33%) 26 (52%) 0.01507 45 (33%) 7 (33%) 0.94759 14 (35%) 19 (66%) 0.01289
mRS (presentation) 1.8 (1.6) 2.4 (1.8) 0.01602 1.8 (1.8) 2.6 (1.9) 0.05172 1.8 (1.6) 2.0 (1.5) 0.38747
Presence of aneurysms 33 (19%) 13 (27%) 0.19296 23 (17%) 6 (32%) 0.11749 10 (25%) 7 (24%) 0.93510
Paresis at presentation 43 (24%) 15 (30%) 0.40296 33 (24%) 5 (24%) 0.99176 10 (25%) 10 (34%) 0.39490
mRS (pre-treatment) 1.5 (1.4) 1.6 (1.5) 0.49277 1.4 (1.5) 1.6 (1.5) 0.38951 1.6 (1.3) 1.6 (1.2) 0.64857
Sex (Male) 92 (52%) 28 (57%) 0.49892 74 (54%) 9 (45%) 0.47186 18 (45%) 19 (66%) 0.09400
Seizures at presentation 42 (24%) 14 (28%) 0.52355 31 (22%) 5 (24%) 0.89115 11 (28%) 9 (31%) 0.75117
Age at 1st treatment (years) 11.2 (3.7) 11.3 (3.9) 0.80199 12.4 (2.7) 11.3 (3.7) 0.24215 10.3 (4.1) 11.4 (4.4) 0.29417
mRS (1-week post-op) 1.8 (1.5) 1.6 (1.3) 0.82396 1.5 (1.5) 1.6 (1.3) 0.66679 1.9 (1.5) 1.8 (1.2) 0.98990

Associations

Correlation matrix.

# Cols
cols <- c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL,
  OUTCOMES
)

# Remove hemorrhage
if (str_detect(params$TITLE, "Hemorrhage")) {
  cols <- cols[!cols == "has_hemorrhage"]
}
  
# Create new dataframe
df_ <-
  df_uni %>%
  select(all_of(cols))

# Convert the outcome variable to numeric for correlation calculation
df_ <-
  df_ %>%
  select(where(is.numeric), where(is.logical)) %>%
  mutate(across(everything(), as.numeric))

# Compute the correlation matrix
cor_matrix <- cor(df_, use = "complete.obs", method = "spearman")

# Plot the correlation matrix
ggcorrplot::ggcorrplot(
  cor_matrix,
  method = "circle",
  lab = TRUE,
  lab_size = 2,
  colors = c("red", "white", "green4"),
  title = "Correlation Matrix",
  hc.order = TRUE,
) +
  theme(
    axis.text.x = element_text(size = 8), # Adjust x-axis text size
    axis.text.y = element_text(size = 8) # Adjust y-axis text size
  )


Univariable statistics

Setup

Define function.

fit_model <- function(
    df = df_uni, cols = cols, is_sandwich = T) {
  # Initialize
  out <- list()

  for (i in seq_along(cols)) {
    # Create formula
    model <- as.formula(paste(CHOSEN_OUTCOME, "~", cols[i]))

    fit <- suppressWarnings(glm(
      model,
      data = df,
      family = binomial()
    ))

    if (is_sandwich) {
      # Calculate robust standard errors (sandwich)
      robust_se <- sandwich::vcovHC(fit, type = "HC0")
      # Compute robust confidence intervals
      fit_results <- lmtest::coeftest(fit, vcov. = robust_se)
    } else {
      fit_results <- fit
    }

    # Summarize model coefficients
    fit_summary <-
      fit_results %>%
      # broom does not support exponentiation after coeftest so do it manually
      broom::tidy(conf.int = T) %>%
      mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
      arrange(term == "(Intercept)", p.value) %>%
      rename(odds_ratio = estimate, z_value = statistic)

    # Stylize and print
    stylized <-
      fit_summary %>%
      rename(
        "Predictors" = term,
        "Odds Ratios (OR)" = odds_ratio,
        "SE" = std.error,
        "Z-scores" = z_value,
        "P-values" = p.value,
        "CI (low)" = conf.low,
        "CI (high)" = conf.high,
      ) %>%
      mutate(
        `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
        `SE` = round(`SE`, 2),
        `CI (low)` = round(`CI (low)`, 2),
        `CI (high)` = round(`CI (high)`, 2),
        `P-values` = round(`P-values`, 3),
      )

    out[[i]] <- stylized
  }
  return(out)
}

Unadjusted - Original

Fit logistic.

# Define cols
cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL
))

# Fit model
out <- fit_model(df = df_uni, cols = cols, is_sandwich = T)

# Create table
univariable_unadjusted <-
  out %>%
  bind_rows() %>%
  filter(Predictors != "(Intercept)") %>%
  filter(`CI (high)` < 50) %>%
  arrange(`P-values`)

# Print
univariable_unadjusted %>%
  sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_final 0.54 0.12 -4.98381 0.000 0.42 0.69
max_size_cm 0.69 0.08 -4.63358 0.000 0.58 0.80
spetzler_martin_grade 0.40 0.16 -5.74497 0.000 0.29 0.55
lawton_young_grade 0.47 0.12 -6.18704 0.000 0.37 0.60
size_score 0.33 0.23 -4.66773 0.000 0.21 0.53
is_eloquent_locationTRUE 0.16 0.46 -4.03430 0.000 0.06 0.38
is_diffuse_nidusTRUE 0.20 0.41 -4.00540 0.000 0.09 0.44
is_spetzler_martin_grade_less_than_4TRUE 4.76 0.34 4.61668 0.000 2.46 9.24
is_surgeryTRUE 9.28 0.39 5.71522 0.000 4.32 19.93
venous_drainageSuperficial 4.39 0.42 3.48242 0.000 1.91 10.08
has_hemorrhageTRUE 2.88 0.33 3.21157 0.001 1.51 5.48
locationDeep 0.07 1.05 -2.52681 0.012 0.01 0.55
has_deep_venous_drainageTRUE 0.44 0.34 -2.40409 0.016 0.22 0.86
has_deficitTRUE 0.46 0.32 -2.40666 0.016 0.24 0.86
venous_drainageDeep 2.71 0.42 2.39697 0.017 1.20 6.12
modified_rankin_score_presentation 1.25 0.10 2.22691 0.026 1.03 1.52
locationOther 0.05 1.75 -1.74397 0.081 0.00 1.46
locationHemispheric 0.25 1.05 -1.32999 0.184 0.03 1.94
procedure_combinationsERS 3.75 1.00 1.31628 0.188 0.52 26.84
has_associated_aneurysmTRUE 0.61 0.38 -1.29670 0.195 0.29 1.28
has_paresisTRUE 0.74 0.35 -0.83640 0.403 0.37 1.49
procedure_combinationsR 2.17 0.96 0.80291 0.422 0.33 14.31
procedure_combinationsER 1.97 0.96 0.70656 0.480 0.30 13.01
is_maleTRUE 0.80 0.33 -0.67700 0.498 0.42 1.52
modified_rankin_score_pretreatment 1.08 0.12 0.67581 0.499 0.86 1.36
has_seizuresTRUE 0.79 0.36 -0.63845 0.523 0.39 1.61
modified_rankin_score_postop_within_1_week 0.93 0.13 -0.54462 0.586 0.73 1.19
age_at_first_treatment_yrs 1.01 0.04 0.18340 0.854 0.93 1.09

Adjusted - Original

Fit logistic.

cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL
))

# Initialize
out <- list()
df <- df_uni

for (i in seq_along(cols)) {
  # Create formula
  model <- as.formula(paste(
    CHOSEN_OUTCOME,
    "~",
    "age_at_first_treatment_yrs + is_male +",
    cols[i]
  ))

  fit <- suppressWarnings(glm(
    model,
    data = df,
    family = binomial()
  ))

  # Calculate robust standard errors (sandwich)
  robust_se <- sandwich::vcovHC(fit, type = "HC0")
  # Compute robust confidence intervals
  robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

  # Summarize model coefficients
  fit_summary <-
    robust_ci %>%
    # broom does not support exponentiation after coeftest so do it manually
    broom::tidy(conf.int = T) %>%
    mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
    arrange(term == "(Intercept)", p.value) %>%
    rename(odds_ratio = estimate, z_value = statistic)

  # Stylize and print
  stylized <-
    fit_summary %>%
    rename(
      "Predictors" = term,
      "Odds Ratios (OR)" = odds_ratio,
      "SE" = std.error,
      "Z-scores" = z_value,
      "P-values" = p.value,
      "CI (low)" = conf.low,
      "CI (high)" = conf.high,
    ) %>%
    mutate(
      `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
      `SE` = round(`SE`, 2),
      `CI (low)` = round(`CI (low)`, 2),
      `CI (high)` = round(`CI (high)`, 2),
      `P-values` = round(`P-values`, 3),
    )

  out[[i]] <- stylized
}

Print results.

# Create table
univariable_adjusted <-
  out %>%
  bind_rows() %>%
  filter(Predictors != "(Intercept)") %>%
  filter(!str_starts(Predictors, "age_at_first_treatment_yrs|is_male")) %>%
  filter(`CI (high)` < 50) %>%
  arrange(`P-values`)

# Print
univariable_adjusted %>%
  sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_final 0.51 0.13 -5.00299 0.000 0.40 0.67
max_size_cm 0.68 0.08 -4.76743 0.000 0.58 0.79
spetzler_martin_grade 0.38 0.16 -5.89898 0.000 0.28 0.53
lawton_young_grade 0.47 0.12 -6.21220 0.000 0.37 0.59
size_score 0.32 0.24 -4.79879 0.000 0.20 0.51
is_eloquent_locationTRUE 0.15 0.47 -3.95835 0.000 0.06 0.39
is_diffuse_nidusTRUE 0.19 0.41 -4.10478 0.000 0.08 0.42
is_spetzler_martin_grade_less_than_4TRUE 5.02 0.34 4.73994 0.000 2.58 9.78
is_surgeryTRUE 9.26 0.40 5.57769 0.000 4.23 20.23
venous_drainageSuperficial 4.80 0.43 3.61418 0.000 2.05 11.24
has_hemorrhageTRUE 2.84 0.34 3.10071 0.002 1.47 5.49
locationDeep 0.06 1.05 -2.60427 0.009 0.01 0.51
has_deficitTRUE 0.44 0.32 -2.59152 0.010 0.23 0.82
has_deep_venous_drainageTRUE 0.41 0.36 -2.52344 0.012 0.20 0.82
venous_drainageDeep 2.79 0.42 2.46488 0.014 1.23 6.30
modified_rankin_score_presentation 1.24 0.10 2.18286 0.029 1.02 1.51
has_associated_aneurysmTRUE 0.58 0.38 -1.41461 0.157 0.28 1.23
locationHemispheric 0.23 1.05 -1.40092 0.161 0.03 1.80
procedure_combinationsERS 3.44 1.06 1.16326 0.245 0.43 27.54
has_paresisTRUE 0.72 0.36 -0.93177 0.351 0.36 1.44
procedure_combinationsR 2.19 1.03 0.76328 0.445 0.29 16.36
modified_rankin_score_postop_within_1_week 0.91 0.12 -0.72063 0.471 0.72 1.17
has_seizuresTRUE 0.78 0.36 -0.69840 0.485 0.38 1.58
modified_rankin_score_pretreatment 1.07 0.12 0.54239 0.588 0.85 1.34
procedure_combinationsER 1.72 1.02 0.53156 0.595 0.23 12.85

Unadjusted - Recoded

Fit logistic.

# Define columns
cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL
))

# Initialize
out <- list()
df <- df_multi

for (i in seq_along(cols)) {
  # Create formula
  model <- as.formula(paste(CHOSEN_OUTCOME, "~", cols[i]))

  fit <- suppressWarnings(glm(
    model,
    data = df,
    family = binomial()
  ))

  # Calculate robust standard errors (sandwich)
  robust_se <- sandwich::vcovHC(fit, type = "HC0")
  # Compute robust confidence intervals
  robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

  # Summarize model coefficients
  fit_summary <-
    robust_ci %>%
    # broom does not support exponentiation after coeftest so do it manually
    broom::tidy(conf.int = T) %>%
    mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
    arrange(term == "(Intercept)", p.value) %>%
    rename(odds_ratio = estimate, z_value = statistic)

  # Stylize and print
  stylized <-
    fit_summary %>%
    rename(
      "Predictors" = term,
      "Odds Ratios (OR)" = odds_ratio,
      "SE" = std.error,
      "Z-scores" = z_value,
      "P-values" = p.value,
      "CI (low)" = conf.low,
      "CI (high)" = conf.high,
    ) %>%
    mutate(
      `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
      `SE` = round(`SE`, 2),
      `CI (low)` = round(`CI (low)`, 2),
      `CI (high)` = round(`CI (high)`, 2),
      `P-values` = round(`P-values`, 3),
    )

  out[[i]] <- stylized
}

Print results.

# Create table
univariable_unadjusted_recoded <-
  out %>%
  bind_rows() %>%
  filter(Predictors != "(Intercept)") %>%
  filter(`CI (high)` < 50) %>%
  arrange(`P-values`)

# Print
univariable_unadjusted_recoded %>%
  sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_final 0.54 0.12 -4.98381 0.000 0.42 0.69
max_size_cm 0.69 0.08 -4.63358 0.000 0.58 0.80
spetzler_martin_grade 0.40 0.16 -5.74497 0.000 0.29 0.55
lawton_young_grade 0.47 0.12 -6.18704 0.000 0.37 0.60
size_score 0.33 0.23 -4.66773 0.000 0.21 0.53
is_eloquent_locationTRUE 0.16 0.46 -4.03430 0.000 0.06 0.38
is_diffuse_nidusTRUE 0.20 0.41 -4.00540 0.000 0.09 0.44
is_spetzler_martin_grade_less_than_4TRUE 4.76 0.34 4.61668 0.000 2.46 9.24
has_hemorrhageTRUE 2.88 0.33 3.21157 0.001 1.51 5.48
locationDeep 0.09 0.77 -3.05605 0.002 0.02 0.43
venous_drainageSuperficial 2.33 0.34 2.46962 0.014 1.19 4.57
has_deep_venous_drainageTRUE 0.44 0.34 -2.40409 0.016 0.22 0.86
has_deficitTRUE 0.46 0.32 -2.40666 0.016 0.24 0.86
modified_rankin_score_presentation 1.25 0.10 2.22691 0.026 1.03 1.52
procedure_combinationsMultimodal 5.02 0.94 1.72117 0.085 0.80 31.49
locationHemispheric 0.33 0.77 -1.42543 0.154 0.07 1.51
has_associated_aneurysmTRUE 0.61 0.38 -1.29670 0.195 0.29 1.28
has_paresisTRUE 0.74 0.35 -0.83640 0.403 0.37 1.49
procedure_combinationsR 2.17 0.96 0.80291 0.422 0.33 14.31
is_maleTRUE 0.80 0.33 -0.67700 0.498 0.42 1.52
modified_rankin_score_pretreatment 1.08 0.12 0.67581 0.499 0.86 1.36
has_seizuresTRUE 0.79 0.36 -0.63845 0.523 0.39 1.61
modified_rankin_score_postop_within_1_week 0.93 0.13 -0.54462 0.586 0.73 1.19
age_at_first_treatment_yrs 1.01 0.04 0.18340 0.854 0.93 1.09

Adjusted - Recoded

Fit logistic.

# Define columns
cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL
))

# Initialize
out <- list()
df <- df_multi

for (i in seq_along(cols)) {
  # Create formula
  model <- as.formula(paste(
    CHOSEN_OUTCOME,
    "~",
    "age_at_first_treatment_yrs + is_male +",
    cols[i]
  ))

  fit <- suppressWarnings(glm(
    model,
    data = df,
    family = binomial()
  ))

  # Calculate robust standard errors (sandwich)
  robust_se <- sandwich::vcovHC(fit, type = "HC0")
  # Compute robust confidence intervals
  robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

  # Summarize model coefficients
  fit_summary <-
    robust_ci %>%
    # broom does not support exponentiation after coeftest so do it manually
    broom::tidy(conf.int = T) %>%
    mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
    arrange(term == "(Intercept)", p.value) %>%
    rename(odds_ratio = estimate, z_value = statistic)

  # Stylize and print
  stylized <-
    fit_summary %>%
    rename(
      "Predictors" = term,
      "Odds Ratios (OR)" = odds_ratio,
      "SE" = std.error,
      "Z-scores" = z_value,
      "P-values" = p.value,
      "CI (low)" = conf.low,
      "CI (high)" = conf.high,
    ) %>%
    mutate(
      `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
      `SE` = round(`SE`, 2),
      `CI (low)` = round(`CI (low)`, 2),
      `CI (high)` = round(`CI (high)`, 2),
      `P-values` = round(`P-values`, 3),
    )

  out[[i]] <- stylized
}

Print results.

# Create table
univariable_adjusted_recoded <-
  out %>%
  bind_rows() %>%
  filter(Predictors != "(Intercept)") %>%
  filter(!str_starts(Predictors, "age_at_first_treatment_yrs|is_male")) %>%
  filter(`CI (high)` < 50) %>%
  arrange(`P-values`)

# Print
univariable_adjusted_recoded %>%
  sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
modified_rankin_score_final 0.51 0.13 -5.00299 0.000 0.40 0.67
max_size_cm 0.68 0.08 -4.76743 0.000 0.58 0.79
spetzler_martin_grade 0.38 0.16 -5.89898 0.000 0.28 0.53
lawton_young_grade 0.47 0.12 -6.21220 0.000 0.37 0.59
size_score 0.32 0.24 -4.79879 0.000 0.20 0.51
is_eloquent_locationTRUE 0.15 0.47 -3.95835 0.000 0.06 0.39
is_diffuse_nidusTRUE 0.19 0.41 -4.10478 0.000 0.08 0.42
is_spetzler_martin_grade_less_than_4TRUE 5.02 0.34 4.73994 0.000 2.58 9.78
has_hemorrhageTRUE 2.84 0.34 3.10071 0.002 1.47 5.49
locationDeep 0.04 1.05 -2.98542 0.003 0.01 0.34
has_deficitTRUE 0.44 0.32 -2.59152 0.010 0.23 0.82
venous_drainageSuperficial 2.51 0.35 2.59197 0.010 1.25 5.03
has_deep_venous_drainageTRUE 0.41 0.36 -2.52344 0.012 0.20 0.82
modified_rankin_score_presentation 1.24 0.10 2.18286 0.029 1.02 1.51
locationHemispheric 0.16 1.05 -1.77453 0.076 0.02 1.21
procedure_combinationsMultimodal 4.60 0.98 1.56510 0.118 0.68 31.14
has_associated_aneurysmTRUE 0.58 0.38 -1.41461 0.157 0.28 1.23
has_paresisTRUE 0.72 0.36 -0.93177 0.351 0.36 1.44
procedure_combinationsR 2.17 1.00 0.77563 0.438 0.31 15.48
modified_rankin_score_postop_within_1_week 0.91 0.12 -0.72063 0.471 0.72 1.17
has_seizuresTRUE 0.78 0.36 -0.69840 0.485 0.38 1.58
modified_rankin_score_pretreatment 1.07 0.12 0.54239 0.588 0.85 1.34

Interaction analysis

Adjusted - Recoded

Fit logistic.

# Define columns
cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL
))

# Remove unwanted columns
unwanted <- c("modified_rankin_score_final")
cols <- cols[!cols %in% unwanted]

adjustors <- c(
  "age_at_first_treatment_yrs",
  "is_male"
)

# Initialize
out <- list()
df <- df_multi
k <- 1

for (i in seq_along(cols)) {
  # Do not fit model for variables that we are adjusting for
  if (cols[i] %in% adjustors) {
    next
  }

  # Create formula
  model <- as.formula(paste(
    CHOSEN_OUTCOME,
    "~",
    "age_at_first_treatment_yrs + is_male +",
    cols[i],
    "* has_hemorrhage",
    sep = ""
  ))

  fit <- suppressWarnings(glm(
    model,
    data = df,
    family = binomial()
  ))

  # Calculate robust standard errors (sandwich)
  robust_se <- sandwich::vcovHC(fit, type = "HC0")
  # Compute robust confidence intervals
  robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

  # Summarize model coefficients
  fit_summary <-
    robust_ci %>%
    # broom does not support exponentiation after coeftest so do it manually
    broom::tidy(conf.int = T) %>%
    mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
    arrange(term == "(Intercept)", p.value) %>%
    rename(odds_ratio = estimate, z_value = statistic)

  # Stylize and print
  stylized <-
    fit_summary %>%
    rename(
      "Predictors" = term,
      "Odds Ratios (OR)" = odds_ratio,
      "SE" = std.error,
      "Z-scores" = z_value,
      "P-values" = p.value,
      "CI (low)" = conf.low,
      "CI (high)" = conf.high,
    ) %>%
    mutate(
      `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
      `SE` = round(`SE`, 2),
      `CI (low)` = round(`CI (low)`, 2),
      `CI (high)` = round(`CI (high)`, 2),
      `P-values` = round(`P-values`, 3),
    )

  out[[k]] <- stylized
  k <- k + 1
}

Print all results results.

# Define values
unwanted <- c(
  "is_maleTRUE",
  "age_at_first_treatment_yrs",
  "(Intercept)",
  "has_hemorrhageTRUE"
)

# Initialize objects
isolated_values <- list()

# Get main and interaction effects
for (i in seq_along(out)) {
  # Get data
  result <- out[[i]]

  # Get all main effects of interest
  main_effects <-
    result %>%
    filter(!Predictors %in% unwanted) %>%
    filter(!str_detect(Predictors, ":")) %>%
    select(-"SE", -"Z-scores")

  # Get the interaction effects
  interaction_effects <-
    result %>%
    filter(!Predictors %in% unwanted) %>%
    filter(str_detect(Predictors, ":")) %>%
    mutate(Predictors = str_extract(Predictors, "^[^:]+")) %>%
    select(-"SE", -"Z-scores") %>%
    rename_with(~ paste("Interaction -", .), -Predictors)

  isolated_values[[i]] <-
    main_effects %>%
    left_join(interaction_effects, by = "Predictors")
}

Create and print table

# Create table
univariable_adjusted_recoded_interactions <-
  isolated_values %>%
  bind_rows() %>%
  arrange(`Interaction - P-values`)

# Print
univariable_adjusted_recoded_interactions %>%
  sable()
Predictors Odds Ratios (OR) P-values CI (low) CI (high) Interaction - Odds Ratios (OR) Interaction - P-values Interaction - CI (low) Interaction - CI (high)
is_eloquent_locationTRUE 4.8000e-01 0.168 1.7000e-01 1.3600e+00 0.00 0.000 0.00 0.00
locationDeep 2.2000e-01 0.189 2.0000e-02 2.1100e+00 0.00 0.000 0.00 0.00
locationHemispheric 5.5000e-01 0.599 6.0000e-02 5.1400e+00 0.00 0.000 0.00 0.00
has_seizuresTRUE 1.5300e+00 0.402 5.7000e-01 4.1200e+00 0.29 0.089 0.07 1.21
is_surgeryTRUE 6.6434e+07 0.000 2.8505e+07 1.5483e+08 0.52 0.207 0.19 1.44
size_score 2.5000e-01 0.000 1.3000e-01 5.1000e-01 1.88 0.222 0.68 5.18
has_paresisTRUE 9.8000e-01 0.974 3.0000e-01 3.1600e+00 0.45 0.332 0.09 2.23
max_size_cm 6.4000e-01 0.000 5.0000e-01 8.2000e-01 1.17 0.355 0.84 1.63
venous_drainageSuperficial 3.8200e+00 0.006 1.4800e+00 9.8700e+00 0.61 0.502 0.15 2.57
has_deep_venous_drainageTRUE 2.8000e-01 0.009 1.1000e-01 7.2000e-01 1.53 0.565 0.36 6.43
is_spetzler_martin_grade_less_than_4TRUE 5.1700e+00 0.001 1.9700e+00 1.3560e+01 0.70 0.626 0.17 2.93
modified_rankin_score_postop_within_1_week 8.8000e-01 0.476 6.1000e-01 1.2600e+00 0.89 0.652 0.53 1.48
modified_rankin_score_presentation 9.9000e-01 0.976 6.5000e-01 1.5300e+00 1.11 0.703 0.66 1.85
has_associated_aneurysmTRUE 3.9000e-01 0.134 1.1000e-01 1.3400e+00 1.36 0.706 0.27 6.82
is_diffuse_nidusTRUE 1.7000e-01 0.004 5.0000e-02 5.8000e-01 1.24 0.800 0.23 6.63
lawton_young_grade 4.6000e-01 0.000 3.1000e-01 6.9000e-01 1.05 0.855 0.62 1.78
has_deficitTRUE 4.1000e-01 0.070 1.5000e-01 1.0800e+00 1.06 0.935 0.24 4.69
modified_rankin_score_pretreatment 9.1000e-01 0.706 5.6000e-01 1.4700e+00 1.01 0.970 0.58 1.77
spetzler_martin_grade 4.1000e-01 0.000 2.5000e-01 6.5000e-01 1.00 0.994 0.53 1.91
procedure_combinationsMultimodal 2.3406e+08 0.00
procedure_combinationsR 1.2091e+08 0.00
procedure_combinationsS 1.2031e+16 0.00

Eloquence.

df %>%
  count(has_hemorrhage, is_eloquent_location, !!sym(CHOSEN_OUTCOME)) %>%
  drop_na() %>%
  arrange(has_hemorrhage, is_eloquent_location) %>%
  group_by(has_hemorrhage, is_eloquent_location) %>%
  mutate(
    num_total = sum(n),
    pct_total = scales::percent(n / num_total, 1)
  ) %>%
  sable()
has_hemorrhage is_eloquent_location is_obliterated n num_total pct_total
FALSE FALSE FALSE 6 28 21%
FALSE FALSE TRUE 22 28 79%
FALSE TRUE FALSE 24 63 38%
FALSE TRUE TRUE 39 63 62%
TRUE FALSE TRUE 61 61 100%
TRUE TRUE FALSE 20 76 26%
TRUE TRUE TRUE 56 76 74%

Deficit.

df %>%
  count(has_hemorrhage, has_deficit, !!sym(CHOSEN_OUTCOME)) %>%
  drop_na() %>%
  arrange(has_hemorrhage, has_deficit) %>%
  group_by(has_hemorrhage, has_deficit) %>%
  mutate(
    num_total = sum(n),
    pct_total = scales::percent(n / num_total, 1)
  ) %>%
  sable()
has_hemorrhage has_deficit is_obliterated n num_total pct_total
FALSE FALSE FALSE 15 57 26%
FALSE FALSE TRUE 42 57 74%
FALSE TRUE FALSE 15 34 44%
FALSE TRUE TRUE 19 34 56%
TRUE FALSE FALSE 9 86 10%
TRUE FALSE TRUE 77 86 90%
TRUE TRUE FALSE 11 51 22%
TRUE TRUE TRUE 40 51 78%

Seizures.

df %>%
  count(has_hemorrhage, has_seizures, !!sym(CHOSEN_OUTCOME)) %>%
  drop_na() %>%
  arrange(has_hemorrhage, has_seizures) %>%
  group_by(has_hemorrhage, has_seizures) %>%
  mutate(
    num_total = sum(n),
    pct_total = scales::percent(n / num_total, 1)
  ) %>%
  sable()
has_hemorrhage has_seizures is_obliterated n num_total pct_total
FALSE FALSE FALSE 23 64 36%
FALSE FALSE TRUE 41 64 64%
FALSE TRUE FALSE 7 27 26%
FALSE TRUE TRUE 20 27 74%
TRUE FALSE FALSE 13 108 12%
TRUE FALSE TRUE 95 108 88%
TRUE TRUE FALSE 7 29 24%
TRUE TRUE TRUE 22 29 76%

Selective inference

Read the following guides:

Setup

Define unwanted columns.

unwanted_all <- c(
  # Use values at or before first treatment
  "modified_rankin_score_postop_within_1_week",
  "modified_rankin_score_final",
  # Use the split between high vs. low grade instead of the granular
  "spetzler_martin_grade"
)

# Remove hemorrhage from the mix
if (str_detect(params$TITLE, "Hemorrhage")) {
  unwanted_all <- c(unwanted_all, "has_hemorrhage")
}

unwanted_without_gradings <- c(
  unwanted_all,
  # Spetzler-Martin grade (size score + eloquent location + venous drainage)
  "is_spetzler_martin_grade_less_than_4",
  "size_score", # Already covered by the more detailed max_size_cm
  "venous_drainage", # Already covered by has_deep_venous_drainage
  "location", # Already covered to some extent by eloquence
  # Lawton-Young grade (SM + age + diffuse nidus + hemorrhage)
  # (SM is heavily overlapping with LY, so include its components)
  "lawton_young_grade",
  # Use mRS pre-treatment (more predictive)
  "modified_rankin_score_presentation" # "modified_rankin_score_pretreatment"
)

# Define cols of interest - use grading scores whenever these exist
unwanted_with_gradings <- c(
  # Spetzler-Martin grade (size score + eloquent location + venous drainage)
  "is_spetzler_martin_grade_less_than_4", # The grade has lower p-value
  "size_score",
  "max_size_cm",
  "is_eloquent_location",
  # "location",  # Include this as not highly correlated with SM
  "has_deep_venous_drainage",
  "venous_drainage",
  # Lawton-Young grade (SM + age + diffuse nidus + hemorrhage)
  # (SM is heavily overlapping with LY, so include its components)
  "lawton_young_grade",
  # "is_diffuse_nidus",
  # "has_hemorrhage",
  # Use values at or before first treatment
  "modified_rankin_score_final",
  "modified_rankin_score_postop_within_1_week",
  # Use mRS pre-treatment (more predictive)
  "modified_rankin_score_presentation"
  # "modified_rankin_score_pretreatment"
)

Create dataset.

# Define cols
cols <- unname(c(
  EXPOSURES_CONTINUOUS,
  EXPOSURES_BINARY,
  EXPOSURES_CATEGORICAL,
  CHOSEN_OUTCOME
))

# Define column sets for each analysis
cols_all <- cols[!cols %in% unwanted_all]
cols_with_gradings <- cols[!cols %in% unwanted_with_gradings]
cols_without_gradings <- cols[!cols %in% unwanted_without_gradings]

# Create df of interest
df_all <-
  # Use the non-recoded dataset
  df_uni %>%
  dplyr::select(all_of(cols_all)) %>%
  drop_na()

df_with_gradings <-
  # Use the non-recoded dataset
  df_uni %>%
  dplyr::select(all_of(cols_with_gradings)) %>%
  drop_na()

df_no_scores <-
  # Use the non-recoded dataset
  df_uni %>%
  dplyr::select(all_of(cols_without_gradings)) %>%
  drop_na()

# Create formula
frla <- as.formula(paste(CHOSEN_OUTCOME, " ~ . - 1"))

# Create matrices with all variables
X_all <- model.matrix(frla, df_all)
y_all <- df_all %>%
  pull(CHOSEN_OUTCOME) %>%
  as.numeric()

# Create matrices with scores
X_with_gradings <- model.matrix(frla, df_with_gradings)
y_with_gradings <- df_with_gradings %>%
  pull(CHOSEN_OUTCOME) %>%
  as.numeric()

# Create matrices with score components
X_without_gradings <- model.matrix(frla, df_no_scores)
y_without_gradings <- df_no_scores %>%
  pull(CHOSEN_OUTCOME) %>%
  as.numeric()

# X should be centered for LASSO (necessary for glmnet)
# (scale = FALSE as this is done by the standardization step in glmnet)
X_all_scaled <- scale(X_all, center = T, scale = F)
X_with_gradings_scaled <- scale(X_with_gradings, center = T, scale = F)
X_without_gradings_scaled <- scale(X_without_gradings, center = T, scale = F)

# See the names of X to match column numbers to column names later on
colnames(X_without_gradings_scaled)
 [1] "age_at_first_treatment_yrs"         "modified_rankin_score_pretreatment"
 [3] "max_size_cm"                        "is_maleFALSE"                      
 [5] "is_maleTRUE"                        "is_eloquent_locationTRUE"          
 [7] "has_deep_venous_drainageTRUE"       "is_diffuse_nidusTRUE"              
 [9] "has_hemorrhageTRUE"                 "has_seizuresTRUE"                  
[11] "has_deficitTRUE"                    "has_paresisTRUE"                   
[13] "has_associated_aneurysmTRUE"        "is_surgeryTRUE"                    
[15] "procedure_combinationsER"           "procedure_combinationsERS"         
[17] "procedure_combinationsES"           "procedure_combinationsR"           
[19] "procedure_combinationsRS"           "procedure_combinationsS"           

Stepwise

Use step-wise linear regression methods.

# Set seed
set.seed(33)

# Run forward step-wise
fsfit <- fs(
  x = X_without_gradings_scaled,
  y = y_without_gradings,
  maxsteps = 2000,
  intercept = TRUE,
  normalize = FALSE
)

# Estimate sigma
sigmahat <- estimateSigma(
  x = X_without_gradings_scaled,
  y = y_without_gradings,
  intercept = TRUE,
  standardize = FALSE
)$sigmahat
# Compute sequential p-values and confidence intervals - for all
# (sigma estimated from full model)
out.seq <- fsInf(fsfit, type = "active", sigma = sigmahat)
out.seq

Call:
fsInf(obj = fsfit, sigma = sigmahat, type = "active")

Standard deviation of noise (specified or estimated) sigma = 0.348

Sequential testing results with alpha = 0.100
 Step Var        Coef Z-score P-value   LowConfPt    UpConfPt LowTailArea UpTailArea
    1  14  3.4100e-01   7.153   0.052 -5.0000e-03  4.0700e-01       0.050      0.049
    2   3 -5.9000e-02  -5.794   0.000 -1.4500e-01 -4.4000e-02       0.050      0.050
    3  11 -1.2400e-01  -2.546   0.194 -2.0300e-01  1.1200e-01       0.049      0.049
    4   8 -1.3200e-01  -1.683   0.296 -5.5300e-01  3.0700e-01       0.050      0.050
    5   5 -5.2000e-02  -1.101   1.000  4.6830e+00         Inf       0.000      0.000
    6   9  5.1000e-02   1.029   1.000        -Inf -5.0030e+00       0.000        NaN
    7   6 -5.4000e-02  -0.986   1.000  5.4860e+00         Inf         NaN      0.000
    8  17  6.7000e-02   0.936   1.000        -Inf -7.1590e+00       0.000        NaN
    9   4  4.2012e+13   0.784   1.000        -Inf -5.3557e+15       0.000      0.000
   10  18  4.5000e-02   0.599   1.000        -Inf -7.5710e+00       0.000        NaN
   11  15  2.5200e-01   1.514   1.000        -Inf -1.6624e+01       0.000        NaN
   12   1  2.0000e-03   0.358   1.000        -Inf -6.2400e-01       0.000        NaN
   13  16 -3.0000e-02  -0.325   1.000  9.1210e+00         Inf         NaN      0.000
   14   7  6.0000e-03   0.117   1.000        -Inf -5.4510e+00       0.000        NaN
   15  12  6.0000e-03   0.074   1.000        -Inf -8.2730e+00       0.000        NaN
   16  20 -5.0000e-03  -0.052   1.000  9.7130e+00         Inf         NaN      0.000
   17  19 -5.1480e+13  -1.811   1.000  2.8423e+15         Inf       0.000      0.000
   18  13 -6.0000e-03  -0.099   1.000  6.2280e+00         Inf       0.000      0.000
   19  10  4.0000e-03   0.062   1.000        -Inf -5.7020e+00       0.000      0.000
   20   2  1.0000e-03   0.049   1.000        -Inf -1.9970e+00       0.000        NaN

Estimated stopping point from ForwardStop rule = 3
# Compute p-values and confidence intervals after AIC stopping - for selected
out.aic <- fsInf(fsfit, type = "aic", sigma = sigmahat)
out.aic

Call:
fsInf(obj = fsfit, sigma = sigmahat, type = "aic")

Standard deviation of noise (specified or estimated) sigma = 0.348

Testing results at step = 4, with alpha = 0.100
 Var   Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
  14  0.298   6.178       1      -Inf   -4.816           0          0
   3 -0.046  -3.927       1     1.176      Inf         NaN          0
  11 -0.113  -2.314       1     4.901      Inf         NaN          0
   8 -0.132  -1.683       1     7.847      Inf         NaN          0

Estimated stopping point from AIC rule = 4
# Compute p-values and confidence intervals after 5 fixed steps
out.fix <- fsInf(fsfit, type = "all", k = 5, sigma = sigmahat)
out.fix

Call:
fsInf(obj = fsfit, sigma = sigmahat, k = 5, type = "all")

Standard deviation of noise (specified or estimated) sigma = 0.348

Testing results at step = 5, with alpha = 0.100
 Var   Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
  14  0.298   6.178       1      -Inf   -4.816           0          0
   3 -0.046  -3.928       1     1.176      Inf         NaN          0
  11 -0.113  -2.307       1     4.902      Inf         NaN          0
   8 -0.134  -1.701       1     7.848      Inf           0          0
   5 -0.052  -1.101       1     4.683      Inf           0          0

LASSO - all

Use LASSO logistic regression using all available variables.

# Set seed
set.seed(141845)

# Define values
X <- X_all
y <- y_all

# Perform cross-validation
# (alpha = 1 for lasso, alpha = 0 for ridge, 0 < alpha < 1 for elastic net)
cv_fit <- cv.glmnet(
  x = X,
  y = y,
  family = "binomial",
  alpha = 1,
  nfolds = 10
)

# Extract the best lambda
best_lambda_min <- cv_fit$lambda.min
best_lambda_1se <- cv_fit$lambda.1se

# Run glmnet
gfit <- glmnet(
  x = X,
  y = y,
  family = "binomial",
  alpha = 1,
  intercept = TRUE,
  standardize = TRUE
)

# Extract coef for a given lambda - avoid the intercept term
# (given the small number of observations hence large SD, use min lambda)
lambda <- best_lambda_min
n <- nrow(y)
beta <- coef(
  x = X,
  y = y,
  object = gfit,
  s = lambda,
  exact = TRUE
)

# Estimate sigma
gsigmahat <- estimateSigma(
  x = X,
  y = y,
  intercept = TRUE,
  standardize = TRUE
)$sigmahat

# Compute fixed lambda p-values and selection intervals
out <- fixedLassoInf(
  x = X,
  y = y,
  beta = beta,
  lambda = lambda,
  # Level of significance
  alpha = 0.1,
  family = "binomial",
  intercept = TRUE,
  sigma = gsigmahat
)

# Print model (make this robust to no variable being selected)
if (length(out) > 0) {
  tibble(
    names = names(out$vars),
    odds_ratio = exp(out$coef0),
    pvalue = out$pv,
    ci_lo = exp(out$ci[, 1]),
    ci_hi = exp(out$ci[, 2])
  ) %>%
    arrange(pvalue) %>%
    sable()
} else {
  print("No variables were selected")
}
names odds_ratio pvalue ci_lo ci_hi
has_deficitTRUE 0.47156 0.12472 0.22613 1.4425e+00
lawton_young_grade 0.78074 0.12708 0.11757 1.2831e+00
venous_drainageDeep 2.59309 0.14339 0.56626 5.4558e+00
max_size_cm 0.86935 0.23075 0.64812 1.2105e+00
is_surgeryTRUE 7.70326 0.23918 0.00148 1.0548e+08
procedure_combinationsERS 0.54856 0.37549 0.00000 3.0254e+04
locationDeep 0.56829 0.38849 0.24631 5.2677e+00
locationCorpus Callosum 5.61645 0.44418 0.00231 4.1000e+01
procedure_combinationsS 4.22752 0.57310 0.00000 8.1359e+03
procedure_combinationsES 2.34490 0.70317 0.00000 3.0948e+03
is_maleFALSE 1.51025 0.72140 0.00704 2.3201e+00
is_diffuse_nidusTRUE 0.79325 0.87127 0.51135 4.5430e+05

LASSO - without scores

Use LASSO logistic regression not based on previous gradings.

# Set seed
set.seed(141845)

# Define values
X <- X_without_gradings
y <- y_without_gradings

# Perform cross-validation
# (alpha = 1 for lasso, alpha = 0 for ridge, 0 < alpha < 1 for elastic net)
cv_fit <- cv.glmnet(
  x = X,
  y = y,
  family = "binomial",
  alpha = 1,
  nfolds = 10
)

# Extract the best lambda
best_lambda_min <- cv_fit$lambda.min
best_lambda_1se <- cv_fit$lambda.1se

# Run glmnet
gfit <- glmnet(
  x = X,
  y = y,
  family = "binomial",
  alpha = 1,
  intercept = TRUE,
  standardize = TRUE
)

# Extract coef for a given lambda - avoid the intercept term
# (given the small number of observations hence large SD, use min lambda)
lambda <- best_lambda_min
n <- nrow(y)
beta <- coef(
  x = X,
  y = y,
  object = gfit,
  s = lambda,
  exact = TRUE
)

# Estimate sigma
gsigmahat <- estimateSigma(
  x = X,
  y = y,
  intercept = TRUE,
  standardize = TRUE
)$sigmahat

# Compute fixed lambda p-values and selection intervals
out <- fixedLassoInf(
  x = X,
  y = y,
  beta = beta,
  lambda = lambda,
  # Level of significance
  alpha = 0.1,
  family = "binomial",
  intercept = TRUE,
  sigma = gsigmahat
)

# Print model (make this robust to no variable being selected)
if (length(out) > 0) {
  tibble(
    names = names(out$vars),
    odds_ratio = exp(out$coef0),
    pvalue = out$pv,
    ci_lo = exp(out$ci[, 1]),
    ci_hi = exp(out$ci[, 2])
  ) %>%
    arrange(pvalue) %>%
    sable()
} else {
  print("No variables were selected")
}
names odds_ratio pvalue ci_lo ci_hi
max_size_cm 0.80106 0.01944 0.65747 9.5181e-01
has_deficitTRUE 0.45615 0.08523 0.23516 1.2026e+00
is_surgeryTRUE 8.17566 0.11092 0.11722 2.2923e+10
is_eloquent_locationTRUE 0.47130 0.20210 0.20176 2.1621e+00
procedure_combinationsERS 0.47423 0.21252 0.00000 1.7332e+02
is_diffuse_nidusTRUE 0.62299 0.43902 0.26334 6.6364e+00
has_hemorrhageTRUE 1.47309 0.51406 0.13455 2.6871e+00
procedure_combinationsS 3.73946 0.70483 0.00000 2.3150e+02
is_maleFALSE 1.37816 0.72311 0.01226 2.3395e+00
procedure_combinationsES 2.06932 0.81869 0.00000 6.8686e+01

LASSO - without components

Use LASSO logistic regression based on previous gradings.

# Set seed
set.seed(141845)

# Define values
X <- X_with_gradings
y <- y_with_gradings

# Perform cross-validation
# (alpha = 1 for lasso, alpha = 0 for ridge, 0 < alpha < 1 for elastic net)
cv_fit <- cv.glmnet(
  x = X,
  y = y,
  family = "binomial",
  alpha = 1,
  nfolds = 10
)

# Extract the best lambda
best_lambda_min <- cv_fit$lambda.min
best_lambda_1se <- cv_fit$lambda.1se

# Run glmnet
gfit <- glmnet(
  x = X,
  y = y,
  family = "binomial",
  alpha = 1,
  intercept = TRUE,
  standardize = TRUE
)

# Extract coef for a given lambda - avoid the intercept term
# (given the small number of observations hence large SD, use min lambda)
lambda <- best_lambda_min
n <- nrow(y)
beta <- coef(
  x = X,
  y = y,
  object = gfit,
  s = lambda,
  exact = TRUE
)

# Estimate sigma
gsigmahat <- estimateSigma(
  x = X,
  y = y,
  intercept = TRUE,
  standardize = TRUE
)$sigmahat

# Compute fixed lambda p-values and selection intervals
out <- fixedLassoInf(
  x = X,
  y = y,
  beta = beta,
  lambda = lambda,
  # Level of significance
  alpha = 0.1,
  family = "binomial",
  intercept = TRUE,
  sigma = gsigmahat
)

# Print model (make this robust to no variable being selected)
if (length(out) > 0) {
  tibble(
    names = names(out$vars),
    odds_ratio = exp(out$coef0),
    pvalue = out$pv,
    ci_lo = exp(out$ci[, 1]),
    ci_hi = exp(out$ci[, 2])
  ) %>%
    arrange(pvalue) %>%
    sable()
} else {
  print("No variables were selected")
}
names odds_ratio pvalue ci_lo ci_hi
spetzler_martin_grade 0.57068 0.01298 0.29968 8.5037e-01
is_surgeryTRUE 11.64518 0.07239 0.64462 1.6585e+03
has_deficitTRUE 0.43738 0.12524 0.23325 1.4960e+00
is_diffuse_nidusTRUE 0.60971 0.24588 0.01107 4.5366e+00
procedure_combinationsERS 0.34890 0.33449 0.00336 3.3700e+01
locationCorpus Callosum 7.89072 0.40014 0.00332 5.1507e+01
has_hemorrhageTRUE 1.44913 0.44167 0.18339 4.7580e+00
is_maleFALSE 1.38509 0.55115 0.00451 4.9059e+01
procedure_combinationsS 2.07997 0.76280 0.00000 1.8295e+01
has_seizuresTRUE 0.75036 0.94815 0.96910 7.3753e+13

Multivariable - Without scores

Feature selection

Create predictors.

# Define p-value threshold
pval_threshold <- 0.2

# Get the predictors of interest
predictors <-
  univariable_adjusted_recoded %>%
  bind_rows(univariable_unadjusted_recoded) %>%
  filter(`P-values` < pval_threshold) %>%
  pull(`Predictors`)

# Clean categorical variables
predictors <-
  predictors %>%
  str_replace("^([a-z0-9_]*)([A-Z].*)$", "\\1") %>%
  unique()

# Add adjustment variables
predictors <- unique(c(predictors))

# Remove unwanted predictors
predictors <- predictors[!predictors %in% unwanted_without_gradings]

# Print
print(predictors)
[1] "max_size_cm"              "is_eloquent_location"     "is_diffuse_nidus"        
[4] "has_hemorrhage"           "has_deficit"              "has_deep_venous_drainage"
[7] "procedure_combinations"   "has_associated_aneurysm" 

Model selection

Fit logistic.

# Define data
df <- df_multi %>% drop_na(modified_rankin_score_pretreatment, location)

# Create formula
model <- as.formula(paste(
  CHOSEN_OUTCOME,
  "~",
  paste(predictors, collapse = " + ")
))

fit <- glm(
  model,
  data = df,
  family = binomial()
)

# Save
fit_without_grading <- fit

Print results.

# Summarize model coefficients
fit_summary <-
  fit %>%
  broom::tidy(exponentiate = T, conf.int = T) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
stylized <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
stylized %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
max_size_cm 7.9000e-01 0.10 -2.49629 0.013 0.65 9.5000e-01
procedure_combinationsMultimodal 1.2830e+01 1.11 2.30563 0.021 1.50 1.3568e+02
is_eloquent_locationTRUE 3.5000e-01 0.56 -1.89264 0.058 0.10 9.7000e-01
procedure_combinationsR 4.7500e+00 1.15 1.35326 0.176 0.51 5.4360e+01
has_hemorrhageTRUE 1.7200e+00 0.42 1.30855 0.191 0.77 3.9300e+00
has_deficitTRUE 6.4000e-01 0.40 -1.10472 0.269 0.29 1.4100e+00
is_diffuse_nidusTRUE 6.1000e-01 0.54 -0.90509 0.365 0.21 1.8100e+00
has_deep_venous_drainageTRUE 7.6000e-01 0.44 -0.61510 0.538 0.32 1.7900e+00
has_associated_aneurysmTRUE 8.2000e-01 0.51 -0.39268 0.695 0.31 2.2800e+00
procedure_combinationsS 3.0835e+08 1426.07 0.01371 0.989 0.00 4.9623e+224
(Intercept) 2.3000e+00 1.19 0.70056 0.484 0.20 2.5580e+01

Inference

Create robust CIs using the sandwich estimator.

# Calculate robust standard errors
robust_se <- sandwich::vcovHC(fit, type = "HC0")

# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

Stylize results.

# Summarize model coefficients
fit_summary <-
  robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
multivariable_pvalue <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
multivariable_pvalue %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
procedure_combinationsS 3.0835e+08 0.80 24.58252 0.000 6.4896e+07 1.4652e+09
procedure_combinationsMultimodal 1.2830e+01 0.82 3.10270 0.002 2.5600e+00 6.4350e+01
max_size_cm 7.9000e-01 0.10 -2.38843 0.017 6.5000e-01 9.6000e-01
is_eloquent_locationTRUE 3.5000e-01 0.56 -1.89442 0.058 1.2000e-01 1.0400e+00
procedure_combinationsR 4.7500e+00 0.88 1.77739 0.076 8.5000e-01 2.6460e+01
has_hemorrhageTRUE 1.7200e+00 0.42 1.28156 0.200 7.5000e-01 3.9500e+00
has_deficitTRUE 6.4000e-01 0.38 -1.14702 0.251 3.0000e-01 1.3700e+00
is_diffuse_nidusTRUE 6.1000e-01 0.56 -0.87372 0.382 2.0000e-01 1.8400e+00
has_deep_venous_drainageTRUE 7.6000e-01 0.47 -0.57248 0.567 3.0000e-01 1.9200e+00
has_associated_aneurysmTRUE 8.2000e-01 0.49 -0.40306 0.687 3.1000e-01 2.1600e+00
(Intercept) 2.3000e+00 1.00 0.83504 0.404 3.3000e-01 1.6320e+01

Plot the coefficients with their confidence intervals.

robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  filter(term != "(Intercept)") %>%
  arrange(desc(estimate)) %>%
  mutate(term = factor(term, levels = term)) %>%
  ggplot(aes(x = estimate, y = term, color = term)) +
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 1, linetype = 2, alpha = 0.5) +
  labs(x = "Odds ratio", y = NULL) +
  guides(color = F) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 13)
  )

# Dotwhisker was not giving me the correct plot
# dotwhisker::dwplot(
#   vline = geom_vline(xintercept = 1, linetype = 2, alpha =  0.5),
#   show_intercept = F
# )

Predict

Predict.

# Predicted probabilities
predicted_probs <- predict(fit, type = "response")

Plot histogram of predictions.

tibble(
  Predictions = predicted_probs,
  Truth = fit$model[, CHOSEN_OUTCOME]
) %>%
  pivot_longer(
    cols = everything(),
    names_to = "keys",
    values_to = "values"
  ) %>%
  mutate(keys = fct_inorder(keys)) %>%
  ggplot(aes(x = values, color = keys, fill = keys)) +
  geom_histogram(alpha = 0.7) +
  geom_vline(xintercept = 0.5, linetype = 2, alpha = 0.5) +
  facet_wrap(vars(keys), ncol = 1, scales = "fixed") +
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text = element_text(size = 12),
    strip.text.x = element_text(size = 13),
    legend.position = "none"
  )

Evaluate

How many examples were used?

# All examples
num_all_examples <- nrow(fit$data)
num_all_examples_positive <- sum(fit$data[, CHOSEN_OUTCOME], na.rm = T)
num_complete_cases <- nrow(fit$model)
num_complete_cases_positive <- sum(fit$model[, CHOSEN_OUTCOME], na.rm = T)

# Print
sprintf(
  "
  All examples = %s (%s with outcome)
  Fitted examples = %s (%s with outcome)",
  num_all_examples,
  num_all_examples_positive,
  num_complete_cases,
  num_complete_cases_positive
) %>% cat()

  All examples = 225 (176 with outcome)
  Fitted examples = 223 (175 with outcome)

Residual analysis.

# Residuals
residuals <- residuals(fit, type = "deviance")

# Plot residuals
plot(residuals)

Pseudo R-squared.

# Calculate Pseudo R-squared values
pseudo_r2 <- pscl::pR2(fit)
fitting null model for pseudo-r2
print(pseudo_r2)
       llh    llhNull         G2   McFadden       r2ML       r2CU 
 -81.92613 -116.14411   68.43596    0.29462    0.26427    0.40837 

ROC-AUC.

# Compute
auroc <- cvAUC::ci.cvAUC(
  predictions = predicted_probs,
  labels = fit$model[, CHOSEN_OUTCOME]
)

# Print
sprintf(
  "Cross-validated AUROC = %.3f (SE, %.3f; %s%% CI, %.3f-%.3f)",
  auroc$cvAUC,
  auroc$se,
  auroc$confidence * 100,
  auroc$ci[1],
  auroc$ci[2]
)
[1] "Cross-validated AUROC = 0.847 (SE, 0.028; 95% CI, 0.792-0.902)"

PR-AUC.

# Construct object
precrec_obj_tr <- precrec::evalmod(
  scores = predicted_probs,
  labels = fit$model[, CHOSEN_OUTCOME],
  calc_avg = F,
  cb_alpha = 0.05
)

# Print
# NOTE: Can use precrec::auc_ci to get CIs, but need to have multiple datasets
precrec_obj_tr %>%
  precrec::part() %>%
  precrec::pauc() %>%
  sable()
modnames dsids curvetypes paucs spaucs
m1 1 ROC 0.84702 0.84702
m1 1 PRC 0.95436 0.95436

ROC and PR curves.

# Plot
precrec_obj_tr %>% autoplot()

Another version of the ROC curve.

# Calculate ROC curve
roc_curve <- pROC::roc(fit$model[, CHOSEN_OUTCOME] ~ predicted_probs)

# Plot ROC curve
plot(roc_curve)

# Calculate AUC
auc <- pROC::auc(roc_curve)
print(auc)
Area under the curve: 0.847

Plot all performance measures.

# Construct object
precrec_obj_tr2 <- precrec::evalmod(
  scores = predicted_probs,
  labels = fit$model[, CHOSEN_OUTCOME],
  mode = "basic"
)

# Plot
precrec_obj_tr2 %>% autoplot()

Confusion matrix.

Kappa statistic: 0 = No agreement; 0.1-0.2 = Slight agreement; 0.2-0.4 = Fair agreement; 0.4-0.6 = Moderate agreement; 0.6-0.8 = Substantial agreement; 0.8 - 1 = Near perfect agreement

No Information Rate (NIR): The proportion of the largest observed class - for example, if we had 35 patients and 23 patients had the outcome, this is the largest class and the NIR is 23/35 = 0.657. The larger the deviation of Accuracy from NIR the more confident we are that the model is not just choosing the largest class.

# Scores
pred <- factor(ifelse(predicted_probs > 0.5, "Event", "No Event"))
truth <- factor(ifelse(fit$model[, CHOSEN_OUTCOME], "Event", "No Event"))

# Count values per category
xtab <- table(
  scores = pred,
  labels = truth
)

# Compute
confusion_matrix <-
  caret::confusionMatrix(
    xtab,
    positive = "Event",
    prevalence = NULL, # Provide prevalence as a proportion here if you want
    mode = "sens_spec"
  )

# Print
confusion_matrix
Confusion Matrix and Statistics

          labels
scores     Event No Event
  Event      165       27
  No Event    10       21
                                       
               Accuracy : 0.834        
                 95% CI : (0.779, 0.88)
    No Information Rate : 0.785        
    P-Value [Acc > NIR] : 0.04050      
                                       
                  Kappa : 0.436        
                                       
 Mcnemar's Test P-Value : 0.00853      
                                       
            Sensitivity : 0.943        
            Specificity : 0.438        
         Pos Pred Value : 0.859        
         Neg Pred Value : 0.677        
             Prevalence : 0.785        
         Detection Rate : 0.740        
   Detection Prevalence : 0.861        
      Balanced Accuracy : 0.690        
                                       
       'Positive' Class : Event        
                                       

Variable importance

Plot importance (based on magnitude of z-score).

# Calculate importance
varimp <-
  fit %>%
  caret::varImp() %>%
  as_tibble(rownames = "rn") %>%
  mutate(Predictor = factor(rn) %>% fct_reorder(Overall)) %>%
  dplyr::select(Predictor, Importance = Overall)

# Plot variable importance
varimp %>%
  ggplot(aes(Predictor, Importance, fill = Importance)) +
  geom_bar(stat = "identity", alpha = 0.9, width = 0.65) +
  coord_flip() +
  guides(fill = F) +
  labs(y = "\nVariable Importance", x = NULL) +
  scale_fill_viridis_c() +
  theme(
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.border = element_blank(),
    axis.text.x = element_text(margin = margin(t = 7), size = 12),
    axis.text.y = element_text(margin = margin(r = -10), size = 12),
    axis.title = element_text(size = 13)
  )


Multivariable - Without components

Feature selection

Create predictors.

# Define p-value threshold
pval_threshold <- 0.2

# Get the predictors of interest
predictors <-
  univariable_unadjusted_recoded %>%
  bind_rows(univariable_adjusted) %>%
  filter(`P-values` < pval_threshold) %>%
  pull(`Predictors`)

# Clean categorical variables
predictors <-
  predictors %>%
  str_replace("^([a-z0-9_]*)([A-Z].*)$", "\\1") %>%
  unique()

# Add adjustment variables
predictors <- unique(c(predictors))

# Remove unwanted predictors
predictors <- predictors[!predictors %in% unwanted_with_gradings]

# Print
print(predictors)
[1] "spetzler_martin_grade"   "is_diffuse_nidus"        "has_hemorrhage"         
[4] "location"                "has_deficit"             "procedure_combinations" 
[7] "has_associated_aneurysm" "is_surgery"             

Model selection

Fit logistic.

# Define data
df <- df_multi

# Create formula
model <- as.formula(paste(
  CHOSEN_OUTCOME,
  "~",
  paste(predictors, collapse = " + ")
))

fit <- glm(
  model,
  data = df,
  family = binomial()
)

fit_with_grading <- fit

Print results.

# Summarize model coefficients
fit_summary <-
  fit %>%
  broom::tidy(exponentiate = T, conf.int = T) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
stylized <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
stylized %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
spetzler_martin_grade 5.0000e-01 0.24 -2.89399 0.004 0.31 7.9000e-01
procedure_combinationsMultimodal 1.9890e+01 1.34 2.23269 0.026 1.68 3.6527e+02
procedure_combinationsR 7.8800e+00 1.37 1.51158 0.131 0.62 1.4862e+02
locationDeep 2.7000e-01 0.88 -1.50208 0.133 0.03 1.2500e+00
has_hemorrhageTRUE 1.8000e+00 0.41 1.45140 0.147 0.82 4.0400e+00
locationHemispheric 2.9000e-01 0.86 -1.42659 0.154 0.04 1.3100e+00
has_deficitTRUE 6.2000e-01 0.41 -1.17787 0.239 0.27 1.3900e+00
is_diffuse_nidusTRUE 6.2000e-01 0.54 -0.87629 0.381 0.22 1.8200e+00
has_associated_aneurysmTRUE 7.4000e-01 0.50 -0.58705 0.557 0.28 2.0400e+00
procedure_combinationsS 4.2928e+08 1410.71 0.01409 0.989 0.00 2.8936e+218
is_surgeryTRUE
(Intercept) 6.6200e+00 1.48 1.27554 0.202 0.41 1.6071e+02

Inference

Create robust CIs using the sandwich estimator.

# Calculate robust standard errors
robust_se <- sandwich::vcovHC(fit, type = "HC0")

# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

Stylize results.

# Summarize model coefficients
fit_summary <-
  robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
multivariable_pvalue <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
multivariable_pvalue %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
procedure_combinationsS 4.2928e+08 0.81 24.54652 0.000 8.7789e+07 2.0991e+09
procedure_combinationsMultimodal 1.9890e+01 0.84 3.56475 0.000 3.8400e+00 1.0299e+02
spetzler_martin_grade 5.0000e-01 0.23 -3.00684 0.003 3.2000e-01 7.9000e-01
procedure_combinationsR 7.8800e+00 0.91 2.27989 0.023 1.3400e+00 4.6460e+01
locationDeep 2.7000e-01 0.86 -1.53515 0.125 5.0000e-02 1.4400e+00
has_hemorrhageTRUE 1.8000e+00 0.40 1.46135 0.144 8.2000e-01 3.9700e+00
locationHemispheric 2.9000e-01 0.85 -1.45986 0.144 6.0000e-02 1.5300e+00
has_deficitTRUE 6.2000e-01 0.41 -1.17976 0.238 2.7000e-01 1.3800e+00
is_diffuse_nidusTRUE 6.2000e-01 0.53 -0.88779 0.375 2.2000e-01 1.7700e+00
has_associated_aneurysmTRUE 7.4000e-01 0.50 -0.58700 0.557 2.8000e-01 2.0000e+00
(Intercept) 6.6200e+00 1.07 1.77459 0.076 8.2000e-01 5.3380e+01

Plot the coefficients with their confidence intervals.

robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  filter(term != "(Intercept)") %>%
  arrange(desc(estimate)) %>%
  mutate(term = factor(term, levels = term)) %>%
  ggplot(aes(x = estimate, y = term, color = term)) +
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 1, linetype = 2, alpha = 0.5) +
  labs(x = "Odds ratio", y = NULL) +
  guides(color = F) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 13)
  )

# Dotwhisker was not giving me the correct plot
# dotwhisker::dwplot(
#   vline = geom_vline(xintercept = 1, linetype = 2, alpha =  0.5),
#   show_intercept = F
# )

Predict

Predict.

# Predicted probabilities
predicted_probs <- predict(fit, type = "response")

Plot histogram of predictions.

tibble(
  Predictions = predicted_probs,
  Truth = fit$model[, CHOSEN_OUTCOME]
) %>%
  pivot_longer(
    cols = everything(),
    names_to = "keys",
    values_to = "values"
  ) %>%
  mutate(keys = fct_inorder(keys)) %>%
  ggplot(aes(x = values, color = keys, fill = keys)) +
  geom_histogram(alpha = 0.7) +
  geom_vline(xintercept = 0.5, linetype = 2, alpha = 0.5) +
  facet_wrap(vars(keys), ncol = 1, scales = "fixed") +
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text = element_text(size = 12),
    strip.text.x = element_text(size = 13),
    legend.position = "none"
  )

Evaluate

How many examples were used?

# All examples
num_all_examples <- nrow(fit$data)
num_all_examples_positive <- sum(fit$data[, CHOSEN_OUTCOME], na.rm = T)
num_complete_cases <- nrow(fit$model)
num_complete_cases_positive <- sum(fit$model[, CHOSEN_OUTCOME], na.rm = T)

# Print
sprintf(
  "
  All examples = %s (%s with outcome)
  Fitted examples = %s (%s with outcome)",
  num_all_examples,
  num_all_examples_positive,
  num_complete_cases,
  num_complete_cases_positive
) %>% cat()

  All examples = 228 (178 with outcome)
  Fitted examples = 223 (175 with outcome)

Residual analysis.

# Residuals
residuals <- residuals(fit, type = "deviance")

# Plot residuals
plot(residuals)

Pseudo R-squared.

# Calculate Pseudo R-squared values
pseudo_r2 <- pscl::pR2(fit)
fitting null model for pseudo-r2
print(pseudo_r2)
       llh    llhNull         G2   McFadden       r2ML       r2CU 
 -80.85057 -116.14411   70.58709    0.30388    0.27133    0.41928 

ROC-AUC.

# Compute
auroc <- cvAUC::ci.cvAUC(
  predictions = predicted_probs,
  labels = fit$model[, CHOSEN_OUTCOME]
)

# Print
sprintf(
  "Cross-validated AUROC = %.3f (SE, %.3f; %s%% CI, %.3f-%.3f)",
  auroc$cvAUC,
  auroc$se,
  auroc$confidence * 100,
  auroc$ci[1],
  auroc$ci[2]
)
[1] "Cross-validated AUROC = 0.847 (SE, 0.029; 95% CI, 0.791-0.904)"

PR-AUC.

# Construct object
precrec_obj_tr <- precrec::evalmod(
  scores = predicted_probs,
  labels = fit$model[, CHOSEN_OUTCOME],
  calc_avg = F,
  cb_alpha = 0.05
)

# Print
# NOTE: Can use precrec::auc_ci to get CIs, but need to have multiple datasets
precrec_obj_tr %>%
  precrec::part() %>%
  precrec::pauc() %>%
  sable()
modnames dsids curvetypes paucs spaucs
m1 1 ROC 0.84732 0.84732
m1 1 PRC 0.95422 0.95422

ROC and PR curves.

# Plot
precrec_obj_tr %>% autoplot()

Another version of the ROC curve.

# Calculate ROC curve
roc_curve <- pROC::roc(fit$model[, CHOSEN_OUTCOME] ~ predicted_probs)

# Plot ROC curve
plot(roc_curve)

# Calculate AUC
auc <- pROC::auc(roc_curve)
print(auc)
Area under the curve: 0.847

Plot all performance measures.

# Construct object
precrec_obj_tr2 <- precrec::evalmod(
  scores = predicted_probs,
  labels = fit$model[, CHOSEN_OUTCOME],
  mode = "basic"
)

# Plot
precrec_obj_tr2 %>% autoplot()

Confusion matrix.

Kappa statistic: 0 = No agreement; 0.1-0.2 = Slight agreement; 0.2-0.4 = Fair agreement; 0.4-0.6 = Moderate agreement; 0.6-0.8 = Substantial agreement; 0.8 - 1 = Near perfect agreement

No Information Rate (NIR): The proportion of the largest observed class - for example, if we had 35 patients and 23 patients had the outcome, this is the largest class and the NIR is 23/35 = 0.657. The larger the deviation of Accuracy from NIR the more confident we are that the model is not just choosing the largest class.

# Scores
pred <- factor(ifelse(predicted_probs > 0.5, "Event", "No Event"))
truth <- factor(ifelse(fit$model[, CHOSEN_OUTCOME], "Event", "No Event"))

# Count values per category
xtab <- table(
  scores = pred,
  labels = truth
)

# Compute
confusion_matrix <-
  caret::confusionMatrix(
    xtab,
    positive = "Event",
    prevalence = NULL, # Provide prevalence as a proportion here if you want
    mode = "sens_spec"
  )

# Print
confusion_matrix
Confusion Matrix and Statistics

          labels
scores     Event No Event
  Event      165       25
  No Event    10       23
                                        
               Accuracy : 0.843         
                 95% CI : (0.789, 0.888)
    No Information Rate : 0.785         
    P-Value [Acc > NIR] : 0.0182        
                                        
                  Kappa : 0.476         
                                        
 Mcnemar's Test P-Value : 0.0180        
                                        
            Sensitivity : 0.943         
            Specificity : 0.479         
         Pos Pred Value : 0.868         
         Neg Pred Value : 0.697         
             Prevalence : 0.785         
         Detection Rate : 0.740         
   Detection Prevalence : 0.852         
      Balanced Accuracy : 0.711         
                                        
       'Positive' Class : Event         
                                        

Variable importance

Plot importance (based on magnitude of z-score).

# Calculate importance
varimp <-
  fit %>%
  caret::varImp() %>%
  as_tibble(rownames = "rn") %>%
  mutate(Predictor = factor(rn) %>% fct_reorder(Overall)) %>%
  dplyr::select(Predictor, Importance = Overall)

# Plot variable importance
varimp %>%
  ggplot(aes(Predictor, Importance, fill = Importance)) +
  geom_bar(stat = "identity", alpha = 0.9, width = 0.65) +
  coord_flip() +
  guides(fill = F) +
  labs(y = "\nVariable Importance", x = NULL) +
  scale_fill_viridis_c() +
  theme(
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank(),
    panel.border = element_blank(),
    axis.text.x = element_text(margin = margin(t = 7), size = 12),
    axis.text.y = element_text(margin = margin(r = -10), size = 12),
    axis.title = element_text(size = 13)
  )


Multivariable - With interactions

Presents with hemorrhage

Fit logistic.

# Define data
df <- df_multi

# Create formula
model <- as.formula(paste(
  CHOSEN_OUTCOME,
  "~",
  "modified_rankin_score_pretreatment*has_hemorrhage +
  spetzler_martin_grade*has_hemorrhage +
  age_at_first_treatment_yrs*has_hemorrhage"
))

fit <- glm(
  model,
  data = df,
  family = binomial()
)

Print results.

# Summarize model coefficients
fit_summary <-
  fit %>%
  broom::tidy(exponentiate = T, conf.int = T) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
stylized <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
stylized %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
spetzler_martin_grade 0.38 0.25 -3.77844 0.000 0.22 0.61
modified_rankin_score_pretreatment:has_hemorrhageTRUE 0.55 0.36 -1.64993 0.099 0.26 1.10
modified_rankin_score_pretreatment 1.57 0.32 1.41656 0.157 0.86 3.05
has_hemorrhageTRUE 6.59 1.73 1.09284 0.274 0.22 204.87
age_at_first_treatment_yrs 1.05 0.06 0.84770 0.397 0.93 1.19
has_hemorrhageTRUE:age_at_first_treatment_yrs 0.95 0.10 -0.55731 0.577 0.78 1.15
has_hemorrhageTRUE:spetzler_martin_grade 1.08 0.36 0.22000 0.826 0.54 2.20
(Intercept) 18.21 1.17 2.48180 0.013 2.20 224.85

Model comparison

Likelihood ratio

No statistical evidence that the grading-based model fits the data better than the grading-independent model based on the likelihood ratio test.

anova(fit_without_grading, fit_with_grading, test = "LR")
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
212 163.85
212 161.70 0 2.1511

Special cases

SM < 4

Select features.

# Define p-value threshold
pval_threshold <- 0.2

# Get the predictors of interest
predictors <-
  univariable_adjusted_recoded %>%
  bind_rows(univariable_unadjusted_recoded) %>%
  filter(`P-values` < pval_threshold) %>%
  pull(`Predictors`)

# Clean categorical variables
predictors <-
  predictors %>%
  str_replace("^([a-z0-9_]*)([A-Z].*)$", "\\1") %>%
  unique()

# Add adjustment variables
predictors <- unique(c(predictors))

# Remove unwanted predictors
predictors <- predictors[!predictors %in% unwanted_with_gradings]

# Replace SM by its binary variant
predictors <- c(predictors, "is_spetzler_martin_grade_less_than_4")
predictors <- predictors[!predictors == "spetzler_martin_grade"]

# Print
print(predictors)
[1] "is_diffuse_nidus"                     "has_hemorrhage"                      
[3] "location"                             "has_deficit"                         
[5] "procedure_combinations"               "has_associated_aneurysm"             
[7] "is_spetzler_martin_grade_less_than_4"

Fit logistic.

# Define data
df <- df_multi %>% drop_na(modified_rankin_score_pretreatment, location)

# Create formula
model <- as.formula(paste(
  CHOSEN_OUTCOME,
  "~",
  paste(predictors, collapse = " + ")
))

fit <- glm(
  model,
  data = df,
  family = binomial()
)

# Save
fit_without_grading <- fit

Print results.

# Summarize model coefficients
fit_summary <-
  fit %>%
  broom::tidy(exponentiate = T, conf.int = T) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
stylized <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
stylized %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
procedure_combinationsMultimodal 1.4190e+01 1.19 2.23496 0.025 1.53 1.9091e+02
locationDeep 1.9000e-01 0.88 -1.91273 0.056 0.02 8.5000e-01
is_diffuse_nidusTRUE 4.2000e-01 0.52 -1.66154 0.097 0.15 1.1700e+00
is_spetzler_martin_grade_less_than_4TRUE 2.0700e+00 0.45 1.61798 0.106 0.85 5.0300e+00
procedure_combinationsR 6.3700e+00 1.22 1.51465 0.130 0.64 9.0260e+01
locationHemispheric 2.7000e-01 0.86 -1.50365 0.133 0.03 1.2000e+00
has_hemorrhageTRUE 1.8200e+00 0.40 1.48815 0.137 0.83 4.0400e+00
has_deficitTRUE 6.5000e-01 0.40 -1.07200 0.284 0.30 1.4400e+00
has_associated_aneurysmTRUE 7.1000e-01 0.49 -0.68499 0.493 0.28 1.9100e+00
procedure_combinationsS 1.9502e+08 878.36 0.02173 0.983 0.00 2.1708e+138
(Intercept) 7.3000e-01 1.29 -0.24738 0.805 0.06 1.0680e+01

Create robust CIs using the sandwich estimator.

# Calculate robust standard errors
robust_se <- sandwich::vcovHC(fit, type = "HC0")

# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

Stylize results.

# Summarize model coefficients
fit_summary <-
  robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
multivariable_pvalue <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
multivariable_pvalue %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
procedure_combinationsS 1.9502e+08 0.84 22.75592 0.000 3.7675e+07 1.0095e+09
procedure_combinationsMultimodal 1.4190e+01 0.83 3.19079 0.001 2.7800e+00 7.2400e+01
procedure_combinationsR 6.3700e+00 0.89 2.07584 0.038 1.1100e+00 3.6620e+01
locationDeep 1.9000e-01 0.93 -1.80418 0.071 3.0000e-02 1.1600e+00
is_diffuse_nidusTRUE 4.2000e-01 0.52 -1.67381 0.094 1.5000e-01 1.1600e+00
is_spetzler_martin_grade_less_than_4TRUE 2.0700e+00 0.45 1.60784 0.108 8.5000e-01 5.0400e+00
has_hemorrhageTRUE 1.8200e+00 0.40 1.47647 0.140 8.2000e-01 4.0100e+00
locationHemispheric 2.7000e-01 0.91 -1.43094 0.152 5.0000e-02 1.6200e+00
has_deficitTRUE 6.5000e-01 0.40 -1.07427 0.283 3.0000e-01 1.4300e+00
has_associated_aneurysmTRUE 7.1000e-01 0.49 -0.69166 0.489 2.8000e-01 1.8500e+00
(Intercept) 7.3000e-01 1.02 -0.31515 0.753 1.0000e-01 5.3100e+00

Plot the coefficients with their confidence intervals.

robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  filter(term != "(Intercept)") %>%
  arrange(desc(estimate)) %>%
  mutate(term = factor(term, levels = term)) %>%
  ggplot(aes(x = estimate, y = term, color = term)) +
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 1, linetype = 2, alpha = 0.5) +
  labs(x = "Odds ratio", y = NULL) +
  guides(color = F) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 13)
  )

# Dotwhisker was not giving me the correct plot
# dotwhisker::dwplot(
#   vline = geom_vline(xintercept = 1, linetype = 2, alpha =  0.5),
#   show_intercept = F
# )

Had surgery

Select features.

# Define p-value threshold
pval_threshold <- 0.2

# Get the predictors of interest
predictors <-
  univariable_adjusted_recoded %>%
  bind_rows(univariable_unadjusted_recoded) %>%
  filter(`P-values` < pval_threshold) %>%
  pull(`Predictors`)

# Clean categorical variables
predictors <-
  predictors %>%
  str_replace("^([a-z0-9_]*)([A-Z].*)$", "\\1") %>%
  unique()

# Add adjustment variables
predictors <- unique(c(predictors))

# Remove unwanted predictors
predictors <- predictors[!predictors %in% unwanted_with_gradings]

# Replace SM by its binary variant
predictors <- c(predictors, "is_surgery")
predictors <- predictors[!predictors == "procedure_combinations"]

# Print
print(predictors)
[1] "spetzler_martin_grade"   "is_diffuse_nidus"        "has_hemorrhage"         
[4] "location"                "has_deficit"             "has_associated_aneurysm"
[7] "is_surgery"             

Fit logistic.

# Define data
df <- df_multi %>% drop_na(modified_rankin_score_pretreatment, location)

# Create formula
model <- as.formula(paste(
  CHOSEN_OUTCOME,
  "~",
  paste(predictors, collapse = " + ")
))

fit <- glm(
  model,
  data = df,
  family = binomial()
)

# Save
fit_without_grading <- fit

Print results.

# Summarize model coefficients
fit_summary <-
  fit %>%
  broom::tidy(exponentiate = T, conf.int = T) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
stylized <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
stylized %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
spetzler_martin_grade 5.5000e-01 0.23 -2.64812 0.008 0.35 8.4000e-01
locationDeep 2.7000e-01 0.85 -1.54021 0.124 0.04 1.2100e+00
has_hemorrhageTRUE 1.8100e+00 0.39 1.50991 0.131 0.84 3.9400e+00
locationHemispheric 4.1000e-01 0.83 -1.07502 0.282 0.06 1.7500e+00
has_deficitTRUE 6.6000e-01 0.40 -1.04939 0.294 0.30 1.4500e+00
is_diffuse_nidusTRUE 6.5000e-01 0.53 -0.81817 0.413 0.23 1.8500e+00
has_associated_aneurysmTRUE 7.1000e-01 0.47 -0.74099 0.459 0.28 1.8000e+00
is_surgeryTRUE 1.1434e+07 863.18 0.01883 0.985 0.00 3.6053e+130
(Intercept) 5.6000e+01 1.13 3.54785 0.000 7.54 6.9535e+02

Create robust CIs using the sandwich estimator.

# Calculate robust standard errors
robust_se <- sandwich::vcovHC(fit, type = "HC0")

# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)

Stylize results.

# Summarize model coefficients
fit_summary <-
  robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  arrange(term == "(Intercept)", p.value) %>%
  rename(odds_ratio = estimate, z_value = statistic)

# Stylize and print
multivariable_pvalue <-
  fit_summary %>%
  rename(
    "Predictors" = term,
    "Odds Ratios (OR)" = odds_ratio,
    "SE" = std.error,
    "Z-scores" = z_value,
    "P-values" = p.value,
    "CI (low)" = conf.low,
    "CI (high)" = conf.high,
  ) %>%
  mutate(
    `Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
    `SE` = round(`SE`, 2),
    `CI (low)` = round(`CI (low)`, 2),
    `CI (high)` = round(`CI (high)`, 2),
    `P-values` = round(`P-values`, 3),
  )

# Print
multivariable_pvalue %>% sable()
Predictors Odds Ratios (OR) SE Z-scores P-values CI (low) CI (high)
is_surgeryTRUE 1.1434e+07 0.38 43.19075 0.000 5468706.59 2.3904e+07
spetzler_martin_grade 5.5000e-01 0.21 -2.89717 0.004 0.37 8.2000e-01
locationDeep 2.7000e-01 0.78 -1.67266 0.094 0.06 1.2500e+00
has_hemorrhageTRUE 1.8100e+00 0.39 1.51604 0.130 0.84 3.8700e+00
locationHemispheric 4.1000e-01 0.79 -1.13739 0.255 0.09 1.9100e+00
has_deficitTRUE 6.6000e-01 0.40 -1.03820 0.299 0.30 1.4500e+00
is_diffuse_nidusTRUE 6.5000e-01 0.51 -0.84556 0.398 0.24 1.7600e+00
has_associated_aneurysmTRUE 7.1000e-01 0.48 -0.71831 0.473 0.28 1.8200e+00
(Intercept) 5.6000e+01 1.16 3.48016 0.001 5.80 5.4050e+02

Plot the coefficients with their confidence intervals.

robust_ci %>%
  # broom does not support exponentiation after coeftest so do it manually
  broom::tidy(exponentiate = F, conf.int = T) %>%
  mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
  filter(term != "(Intercept)") %>%
  arrange(desc(estimate)) %>%
  mutate(term = factor(term, levels = term)) %>%
  ggplot(aes(x = estimate, y = term, color = term)) +
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 1, linetype = 2, alpha = 0.5) +
  labs(x = "Odds ratio", y = NULL) +
  guides(color = F) +
  theme(
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 13)
  )

# Dotwhisker was not giving me the correct plot
# dotwhisker::dwplot(
#   vline = geom_vline(xintercept = 1, linetype = 2, alpha =  0.5),
#   show_intercept = F
# )

Grade 5s

df_uni %>%
  drop_na(spetzler_martin_grade) %>%
  count(is_poor_outcome, spetzler_martin_grade, name = "num_patients") %>%
  mutate(
    pct_patients = scales::percent(num_patients / sum(num_patients), 1)
  ) %>%
  sable()
is_poor_outcome spetzler_martin_grade num_patients pct_patients
FALSE 1 38 17%
FALSE 2 45 20%
FALSE 3 65 29%
FALSE 4 37 16%
FALSE 5 17 7%
TRUE 1 2 1%
TRUE 2 3 1%
TRUE 3 6 3%
TRUE 4 7 3%
TRUE 5 8 4%

Only consider those with SM grade 5.

df_uni %>%
  filter(spetzler_martin_grade == 5) %>%
  count(is_poor_outcome, spetzler_martin_grade, name = "num_patients") %>%
  mutate(
    pct_patients = scales::percent(num_patients / sum(num_patients), 1)
  ) %>%
  sable()
is_poor_outcome spetzler_martin_grade num_patients pct_patients
FALSE 5 17 68%
TRUE 5 8 32%

Write

Setup

Create the necessary directories.

# Get today's date
today <- Sys.Date()
today <- format(today, "%Y-%m-%d")

# Create folder names
base_folder <- file.path(DST_DIRNAME, ANALYSIS_NAME)
date_folder <- file.path(base_folder, today)
csv_folder <- file.path(date_folder, "csv")
pdf_folder <- file.path(date_folder, "pdf")
html_folder <- file.path(date_folder, "html")

# Create folders
suppressWarnings(dir.create(base_folder))
suppressWarnings(dir.create(date_folder))
suppressWarnings(dir.create(csv_folder))
suppressWarnings(dir.create(pdf_folder))
suppressWarnings(dir.create(html_folder))

# Print folder names
print(base_folder)
print(date_folder)
print(csv_folder)
print(pdf_folder)
print(html_folder)
[1] "../../outputs//predictive-analytics/is-obliterated"
[1] "../../outputs//predictive-analytics/is-obliterated/2024-07-30"
[1] "../../outputs//predictive-analytics/is-obliterated/2024-07-30/csv"
[1] "../../outputs//predictive-analytics/is-obliterated/2024-07-30/pdf"
[1] "../../outputs//predictive-analytics/is-obliterated/2024-07-30/html"

Figures

Write all figures.

# Save
# ggsave(
#   file.path(pdf_folder, "histogram_num_days.pdf"),
#   plot = plots$histogram_num_days,
#   width = 6,
#   height = 6
# )
# # Start graphic device
# pdf(
#   file = file.path(pdf_folder, "forest-plot_frequentist.pdf"),
#   width = 10,
#   height = 15
# )
#
# # Plot
# plots$forest_plot_frequentist
#
# # Shut down device
# dev.off()

Tables

Write all tables.

# # Arguments
# df <- tables$desc_stats_cohorts_cv_prolaio
# filepath_csv <- file.path(csv_folder, "desc-stats_by-cohort_cov.csv")
# filepath_html <- file.path(html_folder, "desc-stats_by-cohort_cov.html")
#
# # Save as CSV
# write_csv(
#   x = df,
#   file = filepath_csv
# )
#
# # Save as HTML
# # - Save pdf does not work with webshot
# # - It works with pagedown but not as pretty as desired
# df %>%
#   sable() %>%
#   kableExtra::save_kable(file = filepath_html)
write_csv(
  univariable_unadjusted,
  file.path(csv_folder, "univariable_unadjusted.csv")
)
write_csv(
  univariable_adjusted,
  file.path(csv_folder, "univariable_adjusted.csv")
)
write_csv(
  multivariable_pvalue,
  file.path(csv_folder, "multivariable_pvalue.csv")
)

Data

Write all data.

# # Arguments
# df <- study
# filename <- paste0(ANALYSIS_NAME, "_", today, ".csv")
# filepath_csv <- file.path(DST_BUCKET, dst_dirname_data, filename)
#
# # Save as CSV
# write_csv(
#   x = df,
#   file = filepath_csv
# )

Reproducibility

Linting and styling

# Style current file
styler::style_file(
  path = rstudioapi::getSourceEditorContext()$path,
  style = styler::tidyverse_style
)

# Lint current file
lintr::lint(rstudioapi::getSourceEditorContext()$path)

Dependency management

# Clean up project of libraries not in use
# (use prompt = FALSE to avoid the interactive session)
# (packages can only be removed in interactive mode b/c this is destructive)
renv::clean(prompt = TRUE)

# Update lock file with new packages
renv::snapshot()

Containerization

# Only run this if option is set to TRUE
if (UPDATE_DOCKERFILE) {
  # Create a dockerfile from the session info
  my_dockerfile <- containerit::dockerfile(from = sessionInfo(), env = ls())
  # Write file
  write(my_dockerfile, file = "~/Dockerfile")
  # Print
  print(my_dockerfile)
}

Documentation

Session Info

R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:
/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;
LAPACK version 3.11.0

attached base packages:
[1] parallel stats graphics grDevices datasets utils methods base

other attached packages:
[1] rmarkdown_2.25 table1_1.4.3 shiny_1.8.0
[4] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
[7] dplyr_1.1.4 purrr_1.0.2 readr_2.1.5
[10] tidyr_1.3.1 tibble_3.2.1 tidyverse_2.0.0
[13] selectiveInference_1.2.5 MASS_7.3-60 adaptMCMC_1.5
[16] coda_0.19-4.1 survival_3.5-7 intervals_0.15.4
[19] glmnet_4.1-8 Matrix_1.6-1.1 magrittr_2.0.3
[22] ggplot2_3.5.0 kableExtra_1.4.0 rmdformats_1.0.4
[25] knitr_1.45

loaded via a namespace (and not attached):
[1] splines_4.3.2 later_1.3.2 versions_0.3
[4] R.oo_1.26.0 hardhat_1.4.0 pROC_1.18.5
[7] rpart_4.1.21 rex_1.2.1 lifecycle_1.0.4
[10] tcltk_4.3.2 globals_0.16.3 processx_3.8.3
[13] lattice_0.21-9 vroom_1.6.5 backports_1.4.1
[16] sass_0.4.8 jquerylib_0.1.4 yaml_2.3.8
[19] remotes_2.4.2.1 httpuv_1.6.14 summarytools_1.0.1
[22] pkgload_1.3.4 R.cache_0.16.0 R.utils_2.12.3
[25] styler_1.10.2 nnet_7.3-19 sandwich_3.1-0
[28] ipred_0.9-14 lava_1.8.0 listenv_0.9.1
[31] parallelly_1.37.1 svglite_2.1.3 codetools_0.2-19
[34] xml2_1.3.6 tidyselect_1.2.1 precrec_0.14.4
[37] shape_1.4.6.1 futile.logger_1.4.3 farver_2.1.1
[40] matrixStats_1.3.0 stats4_4.3.2 base64enc_0.1-3
[43] jsonlite_1.8.8 caret_6.0-94 e1071_1.7-14
[46] ellipsis_0.3.2 Formula_1.2-5 iterators_1.0.14
[49] systemfonts_1.0.5 foreach_1.5.2 tools_4.3.2
[52] pryr_0.1.6 Rcpp_1.0.12 glue_1.7.0
[55] gridExtra_2.3 prodlim_2023.08.28 xfun_0.42
[58] withr_3.0.0 formatR_1.14 BiocManager_1.30.22
[61] fastmap_1.1.1 fansi_1.0.6 callr_3.7.5
[64] digest_0.6.34 lintr_3.1.1 timechange_0.3.0
[67] R6_2.5.1 mime_0.12 colorspace_2.1-0
[70] R.methodsS3_1.8.2 utf8_1.2.4 generics_0.1.3
[73] renv_1.0.4 data.table_1.15.4 recipes_1.0.10
[76] class_7.3-22 stevedore_0.9.6 ModelMetrics_1.2.2.2
[79] pkgconfig_2.0.3 gtable_0.3.4 timeDate_4032.109
[82] lmtest_0.9-40 containerit_0.6.0.9004 htmltools_0.5.7
[85] bookdown_0.39 scales_1.3.0 cyclocomp_1.1.1
[88] gower_1.0.1 lambda.r_1.2.4 rstudioapi_0.15.0
[91] tzdb_0.4.0 reshape2_1.4.4 checkmate_2.3.1
[94] nlme_3.1-163 curl_5.2.1 ggcorrplot_0.1.4.1
[97] proxy_0.4-27 zoo_1.8-12 cachem_1.0.8
[100] miniUI_0.1.1.1 desc_1.4.3 pillar_1.9.0
[103] grid_4.3.2 vctrs_0.6.5 pscl_1.5.9
[106] promises_1.2.1 shinyFiles_0.9.3 xtable_1.8-4
[109] evaluate_0.23 cvAUC_1.1.4 magick_2.8.3
[112] cli_3.6.2 compiler_4.3.2 futile.options_1.0.1
[115] rlang_1.1.3 crayon_1.5.2 future.apply_1.11.2
[118] labeling_0.4.3 ps_1.7.6 plyr_1.8.9
[121] fs_1.6.3 stringi_1.8.3 pander_0.6.5
[124] viridisLite_0.4.2 assertthat_0.2.1 munsell_0.5.0
[127] lazyeval_0.2.2 rapportools_1.1 hms_1.1.3
[130] bit64_4.0.5 future_1.33.2 highr_0.10
[133] ROCR_1.0-11 semver_0.2.0 broom_1.0.6
[136] memoise_2.0.1 bslib_0.6.1 bit_4.0.5

References

[[1]]
Scheidegger A, , (2024). _adaptMCMC: Implementation of a Generic Adaptive Monte
Carlo Markov Chain Sampler_. R package version 1.5,
<https://CRAN.R-project.org/package=adaptMCMC>.

[[2]]
R Core Team (2023). _R: A Language and Environment for Statistical Computing_.
R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[3]]
Plummer M, Best N, Cowles K, Vines K (2006). "CODA: Convergence Diagnosis and
Output Analysis for MCMC." _R News_, *6*(1), 7-11.
<https://journal.r-project.org/archive/>.

[[4]]
R Core Team (2023). _R: A Language and Environment for Statistical Computing_.
R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[5]]
Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A Grammar
of Data Manipulation_. R package version 1.1.4,
<https://CRAN.R-project.org/package=dplyr>.

[[6]]
Wickham H (2023). _forcats: Tools for Working with Categorical Variables
(Factors)_. R package version 1.0.0,
<https://CRAN.R-project.org/package=forcats>.

[[7]]
Wickham H (2016). _ggplot2: Elegant Graphics for Data Analysis_.
Springer-Verlag New York. ISBN 978-3-319-24277-4,
<https://ggplot2.tidyverse.org>.

[[8]]
Friedman J, Tibshirani R, Hastie T (2010). "Regularization Paths for
Generalized Linear Models via Coordinate Descent." _Journal of Statistical
Software_, *33*(1), 1-22. doi:10.18637/jss.v033.i01
<https://doi.org/10.18637/jss.v033.i01>.

Simon N, Friedman J, Tibshirani R, Hastie T (2011). "Regularization Paths for
Cox's Proportional Hazards Model via Coordinate Descent." _Journal of
Statistical Software_, *39*(5), 1-13. doi:10.18637/jss.v039.i05
<https://doi.org/10.18637/jss.v039.i05>.

Tay JK, Narasimhan B, Hastie T (2023). "Elastic Net Regularization Paths for
All Generalized Linear Models." _Journal of Statistical Software_, *106*(1),
1-31. doi:10.18637/jss.v106.i01 <https://doi.org/10.18637/jss.v106.i01>.

[[9]]
R Core Team (2023). _R: A Language and Environment for Statistical Computing_.
R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[10]]
R Core Team (2023). _R: A Language and Environment for Statistical Computing_.
R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[11]]
Bourgon R (2023). _intervals: Tools for Working with Points and Intervals_. R
package version 0.15.4, <https://CRAN.R-project.org/package=intervals>.

[[12]]
Zhu H (2024). _kableExtra: Construct Complex Table with 'kable' and Pipe
Syntax_. R package version 1.4.0,
<https://CRAN.R-project.org/package=kableExtra>.

[[13]]
Xie Y (2023). _knitr: A General-Purpose Package for Dynamic Report Generation
in R_. R package version 1.45, <https://yihui.org/knitr/>.

Xie Y (2015). _Dynamic Documents with R and knitr_, 2nd edition. Chapman and
Hall/CRC, Boca Raton, Florida. ISBN 978-1498716963, <https://yihui.org/knitr/>.

Xie Y (2014). "knitr: A Comprehensive Tool for Reproducible Research in R." In
Stodden V, Leisch F, Peng RD (eds.), _Implementing Reproducible Computational
Research_. Chapman and Hall/CRC. ISBN 978-1466561595.

[[14]]
Grolemund G, Wickham H (2011). "Dates and Times Made Easy with lubridate."
_Journal of Statistical Software_, *40*(3), 1-25.
<https://www.jstatsoft.org/v40/i03/>.

[[15]]
Bache S, Wickham H (2022). _magrittr: A Forward-Pipe Operator for R_. R package
version 2.0.3, <https://CRAN.R-project.org/package=magrittr>.

[[16]]
Venables WN, Ripley BD (2002). _Modern Applied Statistics with S_, Fourth
edition. Springer, New York. ISBN 0-387-95457-0,
<https://www.stats.ox.ac.uk/pub/MASS4/>.

[[17]]
Bates D, Maechler M, Jagan M (2023). _Matrix: Sparse and Dense Matrix Classes
and Methods_. R package version 1.6-1.1,
<https://CRAN.R-project.org/package=Matrix>.

[[18]]
R Core Team (2023). _R: A Language and Environment for Statistical Computing_.
R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[19]]
R Core Team (2023). _R: A Language and Environment for Statistical Computing_.
R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[20]]
Wickham H, Henry L (2023). _purrr: Functional Programming Tools_. R package
version 1.0.2, <https://CRAN.R-project.org/package=purrr>.

[[21]]
Wickham H, Hester J, Bryan J (2024). _readr: Read Rectangular Text Data_. R
package version 2.1.5, <https://CRAN.R-project.org/package=readr>.

[[22]]
Allaire J, Xie Y, Dervieux C, McPherson J, Luraschi J, Ushey K, Atkins A,
Wickham H, Cheng J, Chang W, Iannone R (2023). _rmarkdown: Dynamic Documents
for R_. R package version 2.25, <https://github.com/rstudio/rmarkdown>.

Xie Y, Allaire J, Grolemund G (2018). _R Markdown: The Definitive Guide_.
Chapman and Hall/CRC, Boca Raton, Florida. ISBN 9781138359338,
<https://bookdown.org/yihui/rmarkdown>.

Xie Y, Dervieux C, Riederer E (2020). _R Markdown Cookbook_. Chapman and
Hall/CRC, Boca Raton, Florida. ISBN 9780367563837,
<https://bookdown.org/yihui/rmarkdown-cookbook>.

[[23]]
Barnier J (2022). _rmdformats: HTML Output Formats and Templates for
'rmarkdown' Documents_. R package version 1.0.4,
<https://CRAN.R-project.org/package=rmdformats>.

[[24]]
Tibshirani R, Tibshirani R, Taylor J, Loftus J, Reid S, Markovic J (2019).
_selectiveInference: Tools for Post-Selection Inference_. R package version
1.2.5, <https://CRAN.R-project.org/package=selectiveInference>.

[[25]]
Chang W, Cheng J, Allaire J, Sievert C, Schloerke B, Xie Y, Allen J, McPherson
J, Dipert A, Borges B (2023). _shiny: Web Application Framework for R_. R
package version 1.8.0, <https://CRAN.R-project.org/package=shiny>.

[[26]]
R Core Team (2023). _R: A Language and Environment for Statistical Computing_.
R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.

[[27]]
Wickham H (2023). _stringr: Simple, Consistent Wrappers for Common String
Operations_. R package version 1.5.1,
<https://CRAN.R-project.org/package=stringr>.

[[28]]
Therneau T (2023). _A Package for Survival Analysis in R_. R package version
3.5-7, <https://CRAN.R-project.org/package=survival>.

Terry M. Therneau, Patricia M. Grambsch (2000). _Modeling Survival Data:
Extending the Cox Model_. Springer, New York. ISBN 0-387-98784-3.

[[29]]
Rich B (2023). _table1: Tables of Descriptive Statistics in HTML_. R package
version 1.4.3, <https://CRAN.R-project.org/package=table1>.

[[30]]
Müller K, Wickham H (2023). _tibble: Simple Data Frames_. R package version
3.2.1, <https://CRAN.R-project.org/package=tibble>.

[[31]]
Wickham H, Vaughan D, Girlich M (2024). _tidyr: Tidy Messy Data_. R package
version 1.3.1, <https://CRAN.R-project.org/package=tidyr>.

[[32]]
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G,
Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K,
Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K,
Yutani H (2019). "Welcome to the tidyverse." _Journal of Open Source Software_,
*4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.

[[33]]
R Core Team (2023). _R: A Language and Environment for Statistical Computing_.
R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.